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Abstract —We consider a setting in which two nodes (referred 
to as forwarders) compete to choose a relay node from a set 
of relays, as they ephemerally become available (e.g., wake up 
from a sleep state). Each relay, when it arrives, offers a (pos¬ 
sibly different) “reward” to each forwarder. Each forwarder’s 
objective is to minimize a combination of the delay incurred in 
choosing a relay and the reward offered by the chosen relay. 
As an example, we develop the reward structure for the specific 
problem of geographical forwarding over a network of sleep- 
wake cycling relays. 

We study two variants of the generic relay selection problem, 
namely, the completely observable (CO) case where, when a 
relay arrives, both forwarders get to observe both rewards, and 
the partially observable (PO) case where each forwarder can 
only observe its own reward. Formulating the problem as a two 
person stochastic game, we characterize solution in terms of Nash 
Equilibrium Policy Pairs (NEPPs). For the CO case we provide 
a general structure of the NEPPs. For the PO case we prove that 
there exists an NEPP within the class of threshold policy pairs. 

We then consider the particular application of geographical 
forwarding of packets in a shared network of sleep-wake cycling 
wireless relays. For this problem, for a particular reward struc¬ 
ture, using realistic parameter values corresponding to TelosB 
wireless mote, we numerically compare the performance (in terms 
of cost to both forwarders) of the various NEPPs and draw the 
following key insight: even for moderate separation between the 
two forwarders, the performance of the various NEPPs is close 
to the performance of a simple strategy where each forwarder 
behaves as if the other forwarder is not present. We also conduct 
simulation experiments to study the end-to-end performance of 
the simple forwarding policy. 

Index Terms —Competitive relay selection, geographical for¬ 
warding, stochastic games, Bayesian games. 

I. Introduction 

We are concerned in this paper with a class of resource 
allocation problems in wireless networks, in which compet¬ 
ing nodes need to acquire a resource, such as a physical 
radio relay (see the geographical forwarding example later 
in Section |III| > or a channel (as in a cognitive radio network 
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0> @x when a sequence of such resources “arrive” over 
time, and are available only fleetingly for acquisition. In 
this paper, formulating such a problem for two nodes as a 
stochastic game, we consider the completely observable and 
partially observable cases, and provide characterizations of the 
Nash Equilibrium Policy Pairs (NEPP). We provide numerical 
results, and insights therefrom, for a specific reward structure 
derived from the problem of geographical forwarding in sleep- 
wake cycling networks. 

The Geographical Forwarding Context: With the increas¬ 
ing importance of “smart” utilization of our limited resources 
(e.g., energy and clean water) there is a need for instrumenting 
our buildings and campuses with wireless sensor networks. 
As awareness grows and sensing technologies emerge, new 
applications will be implemented. While each application will 
require different sensors and back-end analytics, the availabil¬ 
ity of a common wireless network infrastructure will promote 
the quick deployment of new applications. One approach for 
building such an infrastructure, say, in a large building setting, 
would be to deploy a large number of relay nodes, and 
employ the idea of geographical forwarding. If the phenomena 
to be monitored are slowly varying over time, the traffic 
on the network can be assumed to be light. In addition, 
such applications are delay tolerant, thus accommodating the 
approach of opportunistic geographical forwarding over sleep- 
wake cycling networks 0, 0. 

Sleep-wake cycling is an approach whereby, to conserve 
the relay battery power, their radios are kept turned OFF, 
while coming ON periodically to provide opportunities for 
packet forwarding. The problem of forwarding in such a 
setting was explored in |3), [j4|, where the formulation was 
limited to a single alarm packet flowing through the network. 
Whereas the emphasis in 0 was to develop an end-to- 
end optimal forwarding algorithm, thus requiring a global 
organization step, in 0, which is our prior work on this 
problem, we sought a locally-optimal forwarding heuristic. 
End-to-end forwarding was achieved by applying the local 
heuristic at each forwarding step. We found that, over certain 
range of operation, the performance obtained by the heuristic 
is comparable with the optimal solution provided by |[3J. 

In the setting discussed above, even though the traffic is 
light, there is still a chance that there is more than one 
forwarder seeking a relay from among a set of potential relays. 
There then arises the problem of assigning the relays, as they 
wake-up, to one or the other of the forwarders. This, thus, 
is an extension of the local forwarding problem discussed in 
0 - Formally, the local forwarding problem we consider in 
this paper is the following. There are two forwarders each of 
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which has to choose a relay node to forward its packet to. 
The relays are waking up sequentially over time. Whenever a 
relay wakes up, each forwarder first evaluates the relay based 
on a reward metric (which could be a function of the progress, 
towards the sink, made by the relay, and the power required 
to get the packet to the relay j4j), and then decides whether to 
compete (with the other forwarder) for this relay or continue 
to wait for further relays to wake-up. Such a geographical 
forwarding setting will serve as an example application of the 
stochastic game formulation developed in this paper. 

Outline and Our Contributions: We will describe a general 
system model in Section[II] following which, in SectionfE} we 
will discuss a geographical forwarding problem as an example. 
Related work will be presented in Section IV In Sections [V] 
and VI we will study two variants of the problem (of pro¬ 
gressive complexity), namely, one where complete information 
is available to both forwarders and one with only partial 
information. We will use stochastic game theory to obtain 
solution in terms of (stationary) Nash Equilibrium Policy 
Pairs (NEPPs). We will briefly study a cooperative setting in 
Section |VII| and obtain the Pareto optimal performance curve 
which provides a benchmark for the NEPPs. The following 
are our main technical contributions: 

• For the problem with complete information we obtain 
results illustrating the structure of NEPPs (Theorem [2]) 

• For the partial information case we prove the existence 
of a NE strategy (for a certain Bayesian game) within the 
class of threshold strategies (Theorem |4|. This result will 
enable us to construct NEPPs for this case. 


In Section VIII we provide a simulation study of the 


use of our formulation in the context of geographical 
forwarding. Using realistic parameters from the popular 
TelosB wireless mote, we make the following interesting 
observation: even for moderate separation between the 
two forwarders, the performance of all the NEPPs is 
close to the performance of a simple strategy where each 
forwarder behaves as if it is alone. 


We will finally draw our conclusions in Section IX For the 


ease of presentation we have moved most of the proofs to the 
Appendix. 


II. System Model 

Let and .^ 2 denote the two competing nodes (i.e., 
players in game theoretic terms), referred to as the forwarders. 
We will assume that there are an infinite number of relay 
nodes (or resources in general) that are arriving sequentially 
at times {Wk : k > 0}, which are the points of a Poisson 
process of rate {. Thus, the inter-“arrival" times between 
successive relays, Uk := Wk — Wk- i, are i.i.d. (independent 
and identically distributed) exponential random variables of 
mean r. We will refer to the relay that arrives at the instant Wk 
as the k-tli relay. Further, the fc-th relay is only ephemerally 
available at the instant Wk- 

When a relay arrives, either of the forwarders can compete 
for it, thereby obtaining a reward. Let R Pt k, p = 1,2, denote 
the reward offered by the fc-th relay to ■ ( F p (an example 
reward structure will be discussed in Section [HI)>. The rewards 


-Rp.fc (P = 1,2; k > 1) can take values from a finite set 
1Z = {ri,r 2 , ■■■ , r„}, where r\ = — oo and r* < rj for 
i < j. The reward pairs f? 2 ,fe) are i.i.d. across k, 

with their common joint p.m.f. (probability mass function) 
being Pr 1 ,r 2 (-,-)i For notational simplicity we will denote 
PR 1 ,R 2 ( r ii r :i) as simply p l j. Further, let p^ and denote 
the marginal p.m.f.s of Rik and respectively. Thus, 

p[ 1] = £"=i Pi.j and pf } = £” =1 Pij. 

Actions and Consequences: First we will study (in Sec¬ 
tion 0 a completely observable case where the reward pair 
(Ri,k, Rv.k) is revealed to both the forwarders. Later, in 
Section [Vi] we will consider a more involved (albeit more 
practical) partially observable case where only R\k is re¬ 
vealed to j£j, and f? 2 ,fc is revealed to However in either 
case, each time a relay arrives, the two forwarders have to 
independently choose between one of the following actions: 

• S: stop and forward the packet to the current relay, or 

• C: continue to wait for further relays to arrive. 

Suppose both forwarders choose to stop, then with probabil¬ 
ity (w.p.) v\, gets the relay in which case J ^2 has to con¬ 
tinue alone, while with the remaining probability (v-> = 1 — v \) 
J ^2 gets the relay and continues alone. v p (p = 1,2) 
could be thought of as the probability that will win the 
contention when both forwarders attempt simultaneously. For 
mathematical tractability we will assume that the forwarders 
make their decision instantaneously at the relay arrival instants. 
Further, if a relay is not chosen by either forwarder (i.e., if 
both forwarders choose to continue) we will assume that the 
relay disappears and is not available for further use. 

System State and Forwarding Policy: For the CO case, 
(Ri,k > R 2 ,k) can be regarded as the state of the system at stage 

k, provided both forwarders have not terminated (i.e., chosen 
a relay) yet. When one of the forwarder, say terminates, 
we will represent the system state as {Ri^,t). Similarly, let 
(f, f? 2 ,fc) and ( t , t) represents the state of the system when only 

has terminated and when both forwarders have terminated, 
respectively. Formally, we can define the state space to be 

X = {(ri,rj),(ri,t),(t,rj),(t,t) : n,rj € ftj. (1) 

Given a discrete set S , let A (S) denote the set of all p.m.f.s 
on S. We now have the following definition. 

Definition 1: A forwarding policy 7r is a mapping, n : X —> 
A({s,c}), such that (or fFf) using 7r will choose action 
S or C according to the p.m.f. ir(xk) when the state of the 
system at stage k > 1 is Xk €E X. A policy pair (7Ti,7r 2 ) is a 
tuple of policies such that J?j uses 7Ti and J^ 2 uses 7T2. 

Note that we have restricted to the class of stationary 
policies only. We will denote this class of policies as ng. 

Problem Formulation: Suppose the forwarders use a policy 
pair ( 7 ^, 7t 2 ), and let x £ X be the state of the system at stage 

l. Let K p , p = 1,2, denote the (random) stage at which J^ p 
forwards its packet. Then, the delay incurred by fF p (p = 1,2), 
starting from the instant W\ = U\ (first relay’s arrival instant), 
is Dk p = f/ 2 + ■ ■ ■ + Uk p , and the reward accrued is R p .k p - 
Let E® , T2 [■] denote the expectation operator corresponding to 
the probability law, P* , governing the system dynamics 
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when the policy pair used is (7ri,7r2) and the initial state is x. 
Then, the expected total cost incurred by J^ p is 


&Ax)=K 


Dk„ - VpRp,K p 


( 2 ) 


where rj p > 0 is the multiplier used to trade-off between delay 
and reward. 

Definition 2: We say that a policy pair ( 7 is a Nash 
equilibrium policy pair (NEPP) if, for all x € X, (x) < 

4 iIttjM for an y P olic y TTi e n s , and J%)\.{x) < 

for any policy 772 £ IIs. Thus, a unilateral deviation from an 
NEPP is neither beneficial for nor for ■'P- 2 - 

Our objective will be to characterize the solution in terms 
of NEPPs. 


III. Geographical Forwarding Example 

Before proceeding further, in this section, as a motivating 
example, we will construct a reward structure corresponding 
to the problem of geographical forwarding in sleep-wake 
cycling wireless networks. Let <i£j and ,^2 actually represent 
two forwarding nodes in a wireless network. As shown in 
Fig. □ let Vi and ('2 denote their respective locations. A sink 
node is located at vq. Let d denote the range of both the 
forwarders. Given any location £ £ ft 2 , we define the progress, 
Z p {£), made by location £ with respect to (w.r.t.) J^ p as 

Z p (£) = || v p - v 0 || - || l- v 0 || (3) 

where || • || denotes the Euclidean norm. Thus, Z p (£) is simply 
the difference between J^-to-sink and £-to-sink distances. A 
positive value of Z p {£) implies that location £ is closer to 
the sink than & p . Now, define the forwarding region, C p , of 
t’Z p as the set of all locations that lie within the range of 
p and make non-negative progress w.r.t. j? p , i.e., denoting 
Dp(£) =|| £ — v p || to be the distance between £ and & p , 

C p = [£:Dp{i)<d,Zp{£)> 0 }. (4) 

Let C = C\ U £2 denote the combined forwarding region of 
the two forwarders. As depicted in Fig. [T] we will discretize C 
into a grid of finite set of m locations {£i,£z, ■ ■ • , £ m }■ Thus, 
from here on, whenever we refer to a location £ we mean it 
to be one of the above m locations. 

Sleep-Wake Process: Without loss of generality, we will 
assume that at time 0 each forwarder is holding an alarm 
packet which has to forwarded to a downstream relay node 
(i.e., a node in its forwarding region). Since the relays are 
sleep-wake cycling, each forwarder has to wait until a “good” 
relay wakes up (the goodness of a relay will be based on the 
reward metric to be discussed in this section). 

A practical approach for sleep-wake cycling is the asyn¬ 
chronous periodic sleep-wake process (3), Q, where each 
relay i wakes up at the periodic instants (T) + kT : k > 0} 
with {Tj} being i.i.d. (independent and identically distributed) 
uniform on [0, T] (T is referred to as the sleep-wake cycling 
period). Now, for dense networks where N is large, if T scales 



Fig. 1. One-hop forwarding scenario: vo, v\, and V 2 are the locations of the 
sink, and 2 , respectively; d is the range of each forwarder. Possible 
relay locations are shown as o. 


with N such that ^ = 2 as N — > oo, then the aggregate 
point process of relay wake-up instants converges to a Poisson 
process of rate | 17]. This observation motivates us to model 
the aggregate point process of wake-up instants of relays as a 
Poisson point process. Furthermore, the Poisson point process 
assumption renders our problem analytically tractable, leading 
to interesting structural results. 

Thus, formally, we model the sleep-wake cycling by as¬ 
suming that there are an infinite number of relays waking up 
(within the combined forwarding region C) sequentially at the 
times {Wp : k > 0} which are the points of a Poisson process 
of rate | (thus, a new relay wakes up at each instant Wp). Let 
Lp £ C denote the location of the fc-th relay (i.e., the relay 
waking up at the instant Wp). The locations { Lp : k > 1} are 
i.i.d. random variables with their common p.m.f. (probability 
mass function) being q, i.e., P(Lp = £) = qg. 

Channel Model: We will use the following standard model 
to obtain the transmission power required by p to achieve an 
SNR (signal to noise ratio) constraint of T at the fc-th relay: 


TN 0 ( D p (L k ) y 

Dp,k \ d re f ) 


(5) 


where, iVo is the receiver noise variance, D p {Lk) is the 
distance between ^ p and the A:-th relay whose location is 
Lp, G P ,k is the gain of the channel between J^ p and the fc-th 
relay, £ is the path-loss attenuation factor, and d re f is the far- 
held reference distance beyond which the above expression is 
valid (8), |9j (our discretization of C is such that the distance 
between and any £ £ C is more than d re j). 

We will assume that the set of channel gains {G p ^ : k > 
l,p = 1, 2} are i.i.d. taking values from a finite set Q. Also, 
let P max denote the maximum transmit power with which the 
two forwarders can transmit, i.e., if P p ,k > Pmax then -Zp 
cannot forward its packet to the fc-th relay. Further, we assume 
that the range d (recall Fig. |T|) is such that if the fc-th relay 
is outside the range of (i.e., D p (Lp.) > d), then for any 
G p ,k £ G- Pp,k > Pmax . so that j^ p cannot forward to a relay 
outside its range. Transmitting to a relay inside its range is 
possible, however, provided the channel gain is good enough 
so that the power required is less than P max - 

Relay Rewards: Finally, combining progress and power, we 
will define the reward offered by the fc-th relay to jF p as, 


'Geographical forwarding j5j, |6|, also known as location based routing, 
is a forwarding technique where the assumption is that each node knows its 
location as well as the location of the sink node. 


Pp,k 


z P (L k y 

p(l —a) 


if -tp,k — Pmax 

otherwise, 


( 6 ) 
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where a £ [0,1] is the parameter used to trade-off between 
progress and power. The reward being inversely proportional 
to power is clear because it is advantageous to use low power 
to get the packet across; R p f. is made proportional to Z p (L *.) 
to promote progress towards the sink while choosing a relay 
for the next hop. 

We will use the above reward structure for conducting nu¬ 
merical and simulation experiments in Section |VHl However, 
it is important to note that all our analysis in the subsequent 
sections hold for the general model introduced in Section [II] 


IV. Related Work 

We will first make an important comparison with our 
prior work on the topic, before proceeding to discuss general 
literature from the area of geographical forwarding in wireless 
networks. Our problem can also be considered as a variant of 
the asset selling problem studied in the operations research 
literature; we will discuss related work from this field as well. 
Finally, we survey literature from the area of stochastic games. 

Our Prior Work: Problem of relay selection, but by a 
single forwarder (i.e., the non-competitive version), has been 
extensively studied by us, starting from a simple model where 
the number of relays is exactly known to the forwarder to the 
one where only a belief is known (4j. We have also studied a 
variant with channel probing where the relay rewards are not 
immediately revealed to the forwarder; instead the forwarder 
can choose to learn the reward values by paying an additional 
cost pO) . 

The basic version of our model [4] Section 6] comprises 
only one forwarder and a finite number of relay N; however, in 
the basic model we allow for the forwarder to recall a previous 
relay unlike here where recalling is not allowed. For this basic 
model, the solution is completely in terms of a single threshold 
a: forward to the first relay whose reward is more than a; at 
the last stage choose the best relay irrespective of its reward 
value. From @ Section 6], we further know that the value 
of a does not dependent on N, and hence the solution to 
the version of the basic model with infinite number of relays, 
should still be same. Furthermore, in the infinite horizon model 
there is no advantage in recalling the best relay since there is 
no last stage. Thus, one can argue that the solution to the 
infinite horizon basic relay selection model, without recall, 
should also be characterized by the same threshold a. Here, 
we will formally show that this is in fact the solution for one 
forwarder when the other forwarder has already terminated 
(Lemma [TJ. However when both the forwarders are present, 
the solution is more involved (studied in Section |V-B| i. Thus, 
the competitive model studied here is a generalization of the 
aforementioned version of the basic relay selection model. 

Geographical Forwarding: The problem of choosing a 
next-hop relay arises in the context of geographical for¬ 
warding', geographical forwarding is a forwarding 

technique where the prerequisite is that the nodes know their 
respective locations as well as the sink’s location. The method 
of geographical forwarding was already envisioned in the 80’s 
in the context of routing in packet radio networks (PRNs) GD> 
0 - One of the simplest geographical forwarding technique is 


the greedy algorithm where each node forwards to a neighbor 
in its communication region which makes maximum progress 
towards the sink. This greedy algorithm is referred to as the 
MFR (Max Forward within Radius) routing in ED- Akin to 
MFR is the NFP (Nearest with Forward Progress) proposed 
in ED where a node with a positive progress, and closest to 
the transmitting node is chosen. A generalization of MFR and 
NFP routing is to randomly choose any neighbor which makes 
a positive progress towards the sink fl3| . 

More recently, there are work that apply geographical 
forwarding for routing in sleep-wake cycling networks. For 
instance, Zorzi and Rao in 03 propose an algorithm called 
GeRaF (Geographical Random Forwarding) which, at each 
forwarding stage, chooses the relay making the largest 
progress. For a sleep-wake cycling network, Liu et al. in 0 
propose a relay selection approach as a part of CMAC, a 
protocol for geographical packet forwarding. Under CMAC, 
node i chooses an tq that minimizes the expected normalized 
latency (which is the average ratio of one-hop delay and 
progress). Akin to the relay selection problem is the problem 
of channel selection 0, 0 where a transmitter, given 
several channels, has to choose one for its transmissions. 
Analogous to rewards in our case, the transmitter’s decision 
is based on the throughput the transmitter can achieve on a 
channel. Links to more literature on similar work from the 
context of wireless networks can be found in 0 - However all 
these work do not consider the competitive scenario like ours. 

Asset Selling Problem: Finally, our relay selection problem 
can be considered to be equivalent to the asset selling problem, 
which is a class of the optimal stopping problems studied in 
the operations research literature (other examples of stopping 
problems include the secretary problem 0, bandit problem 
1191, etc). The basic asset selling problem ]20[ Section 4.4] 

1211 comprises a single seller (analogous to a forwarder in our 
model) and a sequence of i.i.d. offers (rewards in our case). 
The seller’s objective is to choose an offer so as to maximize 
a combination of the offer value and the number of previous 
offers rejected. Over the years, several variants of the basic 
problem have been studied. For instance. In 1221, David and 
Levi consider a model in which the offers arrive at the points of 
a renewal process. Kang in [231 has considered a model where 
a cost has to be paid to recall the previous best offer; see 1231 
for further references to literature on models with uncertain 
recall. Variants with unknown offer (or reward) distribution, 
or one where a parameter of the offer distribution is unknown 
have been studied in |[24), (25). 

Our competitive model here can be considered as a game 
variant of the basic asset selling problem, where the two 
forwarders are analogous to the sellers and the reward values 
are analogous to the offers. Although one game variant has 
been studied by Nakagami in [|26j, the specific cost structure in 
our problem enables us to prove results such as the existence of 
Nash equilibrium policy pair within the class of threshold rules 
(Theorem^- Further, we also study a completely observable 
case which is not considered in (26). 

Similarly, literature is available on the game version of the 
secretary problem [27) , J28) , but these consider the simpler 
case where the reward offered by an arriving secretary (or 
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resource) to both players is the same. Moreover, the objective 
in the secretary problem is to maximize the probability of 
choosing the best secretary (resource), which is in contrast to 
our setting (asset selling) which involves a trade-off between 
selection delay and reward. Further, a partially observable 
scenario is not studied in these work. 

Stochastic Games: Stochastic games can be considered as 
a generalization of Markov decision processes (MDPs), in the 
sense that a stochastic game comprises multiple agents (in 
contrast to a single agent in an MDP), who jointly control the 
state of the system while individually incurring a cost in doing 
so. Several references (29l-{34l are available on the topic 


starting from the seminal work by Shapley [351. However, 
most of these work study either discounted or average cost 
objectives, unlike our problem which falls within the realm 
of total-cost transient stochastic games (or stopping games 
136 Part III]). Our formulation can be alternatively thought 
of as a quitting game (37). However, we have introduced state 
transitions and state dependent quitting cost which are not 
considered in the model studied in (37). 

In summary, to the best of our knowledge, the model 
proposed in this paper along with the structural results we 
have derived, are new contributions to the field of stopping 
games. 


V. Completely Observable (CO) Case 

For the CO model we assume that the reward pair, 
(Ri,k,R 2 ,k), of the fc-th relay is entirely revealed to both the 
forwarders. Recalling the geographical forwarding example 
in Section [III] this case would model the scenario where 
the reward is simply the progress, Z p (Lk), the relay makes 
towards the sink, i.e., if a = 1 in Thus, observing 
the location Lk of the fc-th relay, both forwarders (assuming 
that both a-priori know the locations V\,V 2 and vq ; see the 
following remark) can entirely compute i? 2 ,fc)- 

Remark: Justification for knowing the locations is as fol¬ 
lows. All the nodes are equipped with GPS (Global Positioning 
System) devices, using which each node can know its own 
location. Next, the sink being a fixed node, its location is 
already made available to all the nodes before deployment. 
Finally, each forwarder’s knowledge of the other’s location can 
be acquired when both forwarders broadcast control packets 
in response to the control packet transmitted by the first relay. 

We will now proceed to formulate the completely observ¬ 
able case as a stochastic game. Using a key theorem from the 
book by Filar and Vrieze on Competitive Markov Decision 
Processes (29), we will characterize the structure of NEPPs. 

A. Stochastic Game Formulation 

Limiting ourselves to the case of finite set of states and finite 
action sets, formally a stochastic game can be represented by 
a tuple (TV, X, {A p }, T, {g p }) where, 

• Af is the set of agents or players, 

• X is the finite set of system states, 

• A = x ppjsf A p is the joint-action space with A p repre¬ 
senting the finite action set of agent p, 


• T : X x A — t A(A') (the set of all p.m.f.s on X) is 
the probability transition kernel, i.e., T(x'\x,a) is the 
probability that the next state is x' given that the current 
state is x and the current joint-action is a = ( a p : p £ A/”), 

• g p : X x A —> 3? is the (expected) one-step-cost function 
of agent p. 

We will now identify each of these components for our 
problem. The two forwarders, :A-\ and JF 2 , are the players 
(i.e., M = { F\. F- 2 }), and X in (TJ is the state space. The 
action sets are Ai = A2 = {s,c}. 

Transition Probabilities: Recall that p,, :) is the joint p.m.f 
°f (Ri,ki ), P^p and pP are the marginal p.m.f.s of Ri k 
and i? 2i fc, respectively, and v p (p = 1,2) is the probability 
that Afi p will win the contention if both forwarders cooperate. 
Now, the transition probability when the current state is of the 
form x = ( ri,rj ) can be written as, 


T{x'\x, a) = < 


Vi'A 

if a = 

(c, 

c), x' = 

(n 

/,ry) 

Pi' ] 

if a = 

(c, 

s),x' = 

(g 

',*) 

pf 

if o = 

(S. 

0),x' = 

(*» 

r i') 

( 1 ) 

V2P\, 

if a = 

(S, 

s),x' = 

(g 

',*) 

(2) 
"1 w 

if a = 

(S, 

s),x' = 

(*> 

r f) 

0 

otherwise 




the joint- 

action is 

(s 

, s), o 2 p 

(1) 

i' 

is the 


(7) 


bility that -A 2 gets the current relay and the reward offered by 
the next relay to j£j is rp. Similarly, Ui'Pp is the probability 
(again when the joint-action is (s,s)) that :¥\ gets the relay 
and the reward value of the next relay to -A 2 is ry. 

Next, when the state is of the form x = (ryf) (i.e., AC 2 has 
already terminated) the transition probabilities depend only on 
the action oj of and is given by. 


T(x'\x, a) 


pP ifai=C ,x' = (n>,t) 

1 if an = S,x' = (t,t) (8) 

0 otherwise. 


Similarly one can write the expression for T(x'\x,a) when 
the state is x = ( t,rj ). Finally, the state (t,t) is absorbing so 
that T((t, t)\(t,t), a) = 1. 

One-Step Costs: The one-step costs should be such that, for 
any policy pair (7Ti, 7 t 2 ), the sum of all one-step costs incurred 
by CAp (p = 1,2) should equal the total cost in (j2ji. With 
this in mind, in Table [T] we write the pair of one-step costs, 
(gi(x, a), g 2 {x , a)), incurred by and ■'A 2 for different joint- 
actions, a = ( 01 , 02 ), when the current state is x = (r*,rj). 


a = (ai, an) 

(gi(x,a),g 2 (x,a)) 

(c,c) 

{t,t) 

(cm 

C T,-mrj ) 

(sm 

{-m n,T) 

(s,s) 

{-Vin,T) w.p. Vx 
(t, — r)2Tj) w.p. U2 


v ' 5 'IZI j ) w-p. 
TABLE 1 

One-step costs when x = (r;, rj). 


From Table [T] we see that if the joint action is (c, c) then 
both forwarders continue incurring a cost of r which is the 
average time until the next relay arrives. When one of the 
forwarder, say ,^ 2 , chooses to stop (i.e., the joint action is 
(c, s)) then AC 2 , forwarding its packet to the chosen relay. 
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incurs a terminating cost of while simply continues 

incurring an average waiting delay of r. Analogous is the case 
whenever the joint action is (s,c). Finally, if both forwarders 
compete (i.e., the case (s, s)), then with probability v p , & 
gets the relay incurring the terminating cost while the other 
forwarder has to continue. 


CL2 

(gi(x,a),g 2 (x,a)) 

C 

(0 ,t) 

S 

(0 ,-V2Tj) 


ai 

(91 {x, a), 92 (x, a)) 

C 

(r, 0) 

S 

(—9m, 0) 


TABLE 2 TABLE 3 

x = (n,t) x = {t,rj) 

When the state is of the form (r.if) the cost incurred 
by #2 is 0 for any joint-action a, and further the one-step 
cost incurred by depends only on the action ai of . 
Analogous situation holds for J^ 2 when the state is (t. rf). 
These costs are given in Table [2] and [3] respectively. Finally, 
the cost incurred by both the forwarders once the termination 
state (tf) is reached is 0. 

Now, given a policy pair (7r 1 ,7r 2 ) (recall Definition [I]) and 
an initial state x £ X. let { X /. : k > 1} denote the sequence of 
(random) states traversed by the system, and let A 2t k) : 

k > 1} denote the sequence of joint-actions. The total cost in 
0 can now be expressed as the sum of all the one-step costs 
as follows: 


44 » = 


> E* 

/ j TTl ,7T2 

fc = l 




(9) 


B. Characterization of NEPPs 

States of the form (r t . t), (t. rf) 

Once the system enters a state of the form (ri,t), since 
only :¥-\ is present in the system, we essentially have an 
MDP problem where is attempting to optimize its cost. 
Formally, if (zrj, ) is an NEPP then it can be arguec0 that 

n ,(ri,t) is the optimal cost to with being an 

optimal policy; the cost incurred by J^ 2 is 0 and 7rj can be 
arbitrary, but for simplicity we fix 71-3 (r*, t ) = S for all i £ [n\. 
Hence n ,(-,t) satisfies the following Bellman optimality 
equation: 

4 *4* (r*,*) = min| - (10) 

where 


Z) (1) 




a) 


(id 


is the expected cost of continuing alone in the system (r is 
the one-step cost and the remaining term is the future cost-to- 
go). in the min-expression above is the cost of stopping. 

Thus, denoting E— by aPf whenever the state is of the form 
(fi,t) an optimal policy is as follows: 


Ttl{ri,t) 


S if r* > a l 1 ! 

C otherwise. 


( 12 ) 


Remark: As mentioned earlier (recall the discussion in 
related work), the solution to the basic relay selection problem, 


- Using the definition of an NEPP and the fact that the costs and the state 
transitions do not depend on the policy of the other forwarder anymore. 


comprising a single forwarder (say only and a finite 
number of relays TV, is characterized in terms of a single 
threshold a. Furthermore, from our earlier work @ Section 6] 
we know that a is the unique fixed point of 


P^\x) = E nrax{a;, i?i} 


T 

Vi 


(13) 


where the expectation in the above expression is w.r.t. the 
p.m.f. pi 1 ) of R\. Here we will show that a l 1 ) is the fixed 
point of /T 1 ), formalizing our earlier claim that the competitive 
model with only one forwarder and the infinite horizon basic 
model are equivalent. Although this result can be deduced by 
showing the equivalence between our competitive model with 
a single forwarder and the infinite horizon version of the asset 
selling problem, we prove it here for completeness. 

Lemma 1: al 1 ) is the unique fixed point of /3A)(x) ( x £ 
(- 00 , r n }) in p) . 

Proof: We will first show that /Z^ 1 ) is a contraction 
mapping. Then, from the Banach fixed point theorem |38| 
it follows that there exists a unique fixed point a* of 3^K 
Next, through an induction argument we will prove that 
J^) n ,(ri,t) = min | — r/iVi, — 771a* j. Finally, substituting 


for n ,(r.i,t) in = Pfl (recall lA 1 ) from (111) and 
simplifying, we obtain the desired result. Details of me proof 
are available in Appendix [A] ■ 

Similarly, when the state is of the form (t, rj) (i.e., 
has already terminated), if ( 77 ^, 7r 2 ) is an NEPP then. 


44:(^ r j) = 0 and = s ’ while 


( 2 ) 


satisfies 




(14) 


where D ( 2 ) = r-t-JN, n »(t, r ji). Further, a K 

is the unique fixed point of (3( 2 \x) = E max{a;,i? 2 } — 

A , where now the expectation is w.r.t. the p.m.f. p l2> of i? 2 . 
Finally, an optimal policy is such that 


v (2) = PfH. 

~V2 ’ 




if rj > a^ 2 ) 

otherwise. 


(15) 


States of the form ( ri,rj ) 

This is the more interesting case where both forwarders 
are present in the system and are competing to choose a 
relay. When the state is of the form if decides 

to continue while J^ 2 chooses to stop (i.e., the joint-action is 
(c, s)), then terminates by incurring a cost of —t] 2 fj so 
that the next state is of the form (ref). Hence the expected 


total cost incurred by ^j, if it uses the policy in (12 1 from 
the next stage onwards, is /A 1 ) (recall (111). Similarly, if the 


joint-action is (s,c) then terminates incurring a cost of 
—r/iri, and incurs a cost of ZA 2 ) if it uses the policy in 


(15 1 from the next stage onwards. 

If both forwarders decide to stop (joint-action is (s, s)) 
then with probability iq, ,F-\ gets the relay in which case .iA 2 
continues alone, and with probability ;/ 2 it is vice versa. Thus, 
the expected cost incurred by is, 

E {1) (n) = Vii-run) + v 2 D [1) , 


(16) 

























7 


and that by J ^2 is, 

= viD (2) + v 2 (-iWj)- 


(17) 


Finally, if both forwarders choose to continue (i.e., if the 
joint-action is (c, c)) then the next state is again of the form 
(ry, r.jt ). Thus if (7r 1; 7 t 2 ) is the policy pair used from the next 
stage onwards then the expected costs incurred by <i£j and J *2 


are, respectively. 



c (1) 

TTi ,7T2 

if i f 

(18) 

c (2 ) 

TV i,7T 2 

6 tJ 

(19) 


We are now ready to state the following main theorem 
(adapted from [29]), which relates the “NEPPs of the stochas¬ 
tic game” with the “Nash equilibrium strategies of a certain 
static game” played at a stage. The various cost terms de¬ 
scribed above are used to construct this static game. We state 
the theorem below with the understanding that for states of 
the form x = (ri,t) and x = ( t,rj ), 7^(a;) and 7r|(:r) are as 
in (12 1 and (fl5]l, respectively. 


Theorem 1: Given a policy pair, for each state 

x = (r j, rj ) construct the static game given in Table [4] 



c 

s 

c 

c ( y »,c (2 j » 

7F 1 ,7F 2 ’ 7r 2 

DW, —i^rj 

s 

-Viri, DW 



TABLE 4 

Static stage game. 

Then the following statements are equivalent: 

a) (^, 71 - 2 ) is an NEPP. 

b) For each x = ( Ti,Tj ), ( 7 ^( 2 ;), 7 ^( 2 ;)) is a Nash equi¬ 
librium (NE) strategy for the game in Table [4] Fur¬ 
ther, the expected payoff pair at this NE strategy is. 

Proof: Although the proof of this theorem is along the 
lines of the proof of Theorem 4.6.5 in (29), however some 


additional efforts are required since the proof in 1291 is for 
the case where the costs are discounted, while ours is a total 
cost undiscounted stochastic game. Further, the presence of a 
cost-free absorption state for each player renders our problem 
transient by which we mean, when the policy of one player is 
fixed the problem of obtaining the optimal policy for the other 
player is a stopping problem (39) . Using this property we have 
modified the proof of (29] Theorem 4.6.5] appropriately so that 
the result holds for our case. For details, see Appendix [B] ■ 
Discussion: In this discussion for simplicity we will omit 
(tt( , 71 " 2 ) from all the associated notations. Now, Theorem [T] 
can be seen as an extension of the Bellman optimality equation 
in (10 1 , where to obtain J'- 1 \ri,t) we require the cost term 
D' 1 in (111, which in turn depends on the function jW (•,*). 
This essentially suggests that is the fixed point of the 

Bellman equation in ( |TT)| ). Similarly, here we see that, given 
the cost pair ( C l 1 ), C^f), one can obtain J^(x)) by 

solving the game in Table [ 4 ] However, computing Cf 2 f 

itself will require the function pair ((•), (•)), thus 

suggesting that (jl 1 ) (•), (•)) has to be fixed point of a 



Fig. 2. Illustration of the various regions along with the NE strategies 
corresponding to these regions. 

mapping which involves computing the payoff pair of the 
static game in Table [4] Furthermore, analogous to computing 
the minimum in (TO) to obtain the optimal action, here, by 
computing the NE strategies of the game in Table [4] we obtain 
the solution to our stochastic game. 

Assuming that the cost pair (CfJJ . C^) n ») is given to us, 
we now proceed to obtain all the NE strategies of the game 
in Table [4] We will first require the following key lemma. 

Lemma 2: For an NEPP, ( 7 ^, 712 ), the various costs are 
ordered as follows: 


L> (1) < CiV„. and Z? (2) < 


( 20 ) 


Proof: See Appendix [C] ■ 

Discussion: The above lemma becomes intuitive once we 
recall that If 1 * is the optimal cost incurred by -fy if it is 
alone in the system, while C'iV . is the cost incurred if :f 2 is 

77 1 5 7r 2 

also present, and competing with in choosing a relay. One 
would expect to incur a lower cost without the competing 
forwarder. 

For notational simplicity, from here on, we will denote the 
costs C^) . and ClP* as simply C W and C^ 2 \ We will 
write C for the pair ( y C ( ^ 1 \C l ' 2 ' > ). An important consequence 
of Femma [2] is that, while solving the game in Table [4] it is 
sufficient to only consider cost pairs, (C'^ 1 \ C^), which are 
ordered as in the lemma; the other cases (e.g., > C or 

Z)( 2 ) > C^) cannot occur, and hence need not be considered. 

Further, for convenience let us denote the thresholds -— and 

-m 

by and (j( 2 ), respectively (recall that we already have, 

a (i) _ pff anc j a ( 2 ) _ Then, the solution (i.e., the NE 

~ Vl ~ V2 r-[ 

strategies) to the game in Table HI for each ( rt,rj) pair, is as 
depicted in Fig. [2] 

We see that the thresholds (c/ 1 ),^ 1 )) and (a^ 2 \C,^) 
partition the reward pair set, {(r^. ry ) : i,j £ [n]}, into 5 
regions (TZi, • • • , 7?-5f 1 such that the NE strategy (strategies) 
corresponding to each region are different. For instance, for 
any ( Ti,rj) £ 1Z\, (c,c) (i.e., both forwarders continue) is the 
only NE strategy, while within 1Z 2 , (s,c) is the NE strategy, 
and so on. All regions contain a unique pure NE strategy 


3 These regions depend on the cost pair C; for simplicity we neglect C in 
their notation. However, we will invoke this dependency when required. 






















except for 724 where (s,c), (c, s), and the mixed strategy 
(ri,r 2 ) (r p is the probability with which & p chooses S) are 
all NE strategies. The expression for Ti is 


_ -r\ 2 rj - C (2) _ 

( - V2rj - C( 2 )) - [E^){r 3 ) - DG)\ 


(23) 


Analogously one can write the expression for r 2 . For details 
on how to solve the game in Table [4] to obtain the various 
regions, see Appendix [P] Finally, we summarize the observa¬ 
tions made thus far in the form of the following theorem. 

Theorem 2: The NE strategies of the game in Table |4] are 
completely characterized by the threshold pairs ,Q P ^), 
p = 1, 2 as follows (recall Fig. [2] for illustration): 

• If Ti is less than t/ 1 - 1 , then the NE strategy recommends 
C for irrespective of the reward value r :l of J^ 2 . 

• On the other hand, if r,; is more than aG ;i , then the NE 
strategy recommends action S for .9-\ irrespective of the 
value of r 3 (note that this is exactly the action -'F-\ would 
choose if it was alone in the system; see the discussion 
following (fl2|). 

• Finally, the presence of the competing forwarder J^ 2 is 
felt by only when its reward value is between C*- 1 -* 
and a.G\ in which case the NE strategies are: (s,c) if 

r 3 < C (2) ; (s, c), (c, s) and (r 1 ; r 2 ) if C (2) < rj < a^; 
and (c, s) if r ? > a7 2 T 

Analogous results hold for J? 2 . 


C. Constructing NEPPs from NE strategies 

The cost terms I )( 1 ! and I) <2 > can be easily computed 
by solving the optimality equations © and ( p~4] >, respec¬ 
tively. Alternatively, we can first compute the fixed points of 
and ) to obtain a and oS- 2 \ respectively (recall 
Lemma[l]). Then, D W = —piaG't and DG) = —i 7 2 o7 2 ). 

The costs CG) and CGl (in (18 1 and (19i) depend on the 
particular NEPP used, i.e., require the cost terms » (r,. rj) 

(O) 12 

and n *{ri,rj) for all ( r^r 3 ) to compute them. Con- 

that 4* \*A r i> r j) 


/* 1 5 /• 2 N ’ J ' ' J ' 

versely, Part-(b) of Theorem 1 suggests 

(2\ 

^rpcnppfi\/p1\/ 7"' ' lnr . . U pan lip 


(respectively, J^» n ,(ri,rj)) can be obtained by computing 
the expected cost incurred by J^l (respectively, j^ 2 ) at a NE 
strategy of the game in Table [4] which in turn requires the 
terms CG) and . Hence, to obtain (CG\cG'>) we proceed 
by expressing (CGf C^) as the fixed point of a mapping T 
which can then be used to compute these costs. 

Suppose (7r*, 7 t 5) is a NEPP such that for all x = (ri, rf) £ 
724 (C) the NE strategy (tt 3 (x), 7 ^( 2 :)) is (s,c). Then using 


part 2(b) of Theorem [T] we can write. 


4 :U( r i» r i) = 


CG) 

if (r U Tj) 

£ 72t(C) 


-T)l r i 

if (n,rj) 

£ 72 2 (C) 


DG) 

if (ri,rj) 

e72 3 (C) 

(24) 

-Vi n 

if (n,rj) 

£ 72.5(C) 


EG)(n 

) if (n,rj) 

£ 72 4 (C). 


Til), 

can be written as CG) = 

71(C) 


where the function 71(C) is as in (21 1 (where for simplicity, 
we have used (i,j) instead of (r,;. r-, )). Similarly, C^ 2 ) can be 
expressed as CGl = 71(C); see \22). Thus, C is a fixed point 


of the mapping T(C) := (71(C), 71(C)). 

We do not have results showing that T indeed has a fixed 
point or equivalently that an NEPP ( 7 ^, 77 !) always existsj^] 
although such a result holds for the discounted stochastic 


game [29 Theorem 4.6.4] (recall that ours is a transient 


stochastic game). However, in our numerical results section 
(Section [VTTTj i we were able to numerically obtain C by 
iteration. Thus, we begin with an initial C(0) such that 
C 1 - 1 ) (0) < D and C( 2 i(0) < DG\ and inductively iterate 
to obtain C(fc) = T(C(fc— 1)) until convergence is achieved. 
Finally, given a fixed point C, we obtain the corresponding 
NEPP ( 77 ]“, 7 r 2 ) by constructing the various regions as in Fig. [ 2 ] 
Other NEPPs: Recall that to obtain (C (y> , C 12 ' 1 ) we had re¬ 
stricted to use NE strategy (s,c) whenever (r^r^) £ 


72 . 4 (C). We can similarly obtain NEPPs (7 


1 > /l 2 


and (7 


1 5 "2 


') 


(whose corresponding cost pairs are C 0 and Cq) by restricting 
to the NE strategies (c,s) and (ri,r 2 ) whenever (ri,rj) £ 
72.4(C 0 ) and (ri,Tj) £ respectively. In Section VIII 


we will numerically compare the performances of all these 
various NEPPs. 


VI. Partially Observable Case 

Let us first formally introduce a finite location set C. Let 
denote the location of the fc-th relay. The locations {Lk : k > 
1} are i.i.d with their common p.m.f. being (qg : £ £ C). Recall 
that for the PO case we assume that only R p k is revealed to 
& p (p = 1, 2). In addition, we will assume that L/- is revealed 
to both the forwarders. 

Recalling the geographical forwarding example from Sec¬ 
tion III the PO case corresponds to the scenario where, in 
addition to Lk, the gains G f , k are required to compute 
i.e., if a < 1 in & Hence, &\ not knowing G<i,k cannot 
compute However, knowing the channel gain distribution 
(recall that the gains are identically distributed) it is possible 

4 This equivalence can be easily shown by first using (7rJ, 7rJ) in part-(a) of 
Theoremfllto conclude that part-(b) holds, and then simply from the definition 
of T it will follow that it has a fixed point. For the other direction, given a 
fixed point C of 7~, one can easily obtain the corresponding NEPP (7rJ, 7rJ) 
by constructing the various regions as shown in Fig. [2] 


71(C) = r+ Y PiJ c(1) + J2 Y pi, 3 d (1) + Y Pi,j E(1) (n)(2i) 

{i,j)en i(C) ( i,j)en 2 (C)un 4 (c ) c) (i,j)en 5 ( c) 

71(C) = T+ Y Pi,j c(2) + Pi,j D(2) + Y Pi,j(~V 2 fj) + Y Pi,jE {2) (r j )(22) 

(i,j) 6Ri(C) (i,j)eTZ 2 (C)UTZ 4 (C) 3(C) (i,j)en 5 ( c) 
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for to compute the probability distribution of li->.k given 
Lfc. Similarly, ,^ 2 can compute the distribution of ll\ given 
Lfe. Further, since the gains, (G\ j-, G->.k), are independent, 
it follows that Ri ^ and are independent given L*. (but 
unconditionally they may be dependent). 

Formally, given that L & = /:, we will assume the following 
independence condition: 

PR lt R 2 \L k { r ii r j\£) = PRi\L k (ri\t)P R2 \ Lk {rj\t). (25) 

For simplicity, we will denote the conditional p.m.f.s 

PR 1 \L k (r i \e) and p R2 \ Lk (rj\t), i,j £ [n], by p^j and p™, 

respectively. 

Remark: Usually for a model with partial observations the 
belief that will maintain about will simply be the 
conditional distribution PR 2 \R 1 (rj\ri) = However, we 

have exploited the particular structure in our reward expression 
to come up with the independence condition in ( |25f l. This 
condition will enable us to prove a key result later which is 
otherwise not possible (see the remark following Lemma [4|. 
Finally, all our subsequent results will hold for a more general 
model wherever the independence condition in ( |25] > will hold. 

We will now proceed to formulate our partially observable 
model as a partially observable stochastic game (POSG). We 
will first formally describe the problem setting and then briefly 
discuss POSGs, before proceeding to our main results. 

A. Problem Formulation 

The actual state space of the system continues to be X 
(see ©)■ However, each forwarder now gets to observe only 
its part of the actual state (i.e., only its reward value) along 
with the relay’s location. Thus, when the fc-th relay arrives, and 
if both forwarders are still competing then the observations of 
and «^" 2 are of the form () and (f, r^), respectively, 
where (rj, r.y) is the actual state, = £ is the location of the 
fc-th relay. Suppose :¥•> has already terminated before stage 
k therj^] the location information is no more required by J^i, 
and hence we will denote its observation as (r t . t) which is 
simply the system state. Finally, when terminates we use t 
to denote its subsequent observations. Thus, we can write the 
observation space of as, 

Oi = ^(ri,e),(ri,t)X : i £ [n\,£ £ [rn\Y (26) 

Similarly, the observation space of j^ 2 is given by 

0 2 = : j £ [n\,i £ [m]}. (27) 

Definition 3: We will modify]^] the definition of a policy pair, 
( 7 F 1 ,7f 2 ) (see Definition [TJ, such that 7fi : 0\ — \ {s,c} and 
7f 2 : 0 2 —> {s,c}. Thus, the decision to stop or continue 
by and ^ 2 , when the fc-th relay arrives is based on their 
respective observations o± t k £ 0\ and 02 ,k £ O 2 . 

Remark: Note that we have restricted the PO policies to be 
deterministic (and as before stationary), i.e., 7fi(oi) is either S 

5 As mentioned earlier, J£j will come to know about j? 2 ’s termination by 
listening to the exchange of control packets between P 2 and the chosen relay 
just before termination. 

6 In this section we will apply overline to most of the symbols in order to 
distinguish them from the corresponding symbols that have already appeared 
in Section |V| 


or C without mixing between the two. Let 11 u denote the set of 
all such deterministic policies. Restricting to is primarily 
to simplify the analysis. However, it is not immediately clear 
if a partially observable NEPP (to be formally defined very 
soon) should even exist within the class Hu. Our main result 
is to construct a Bayesian stage game and prove that this game 
contains pure strategy (or deterministic) NE vectors using 
which PO-NEPPs in 11 d can be constructed. 

Let {(Oi,fe,0 2ife ) : fc > 1}, denote the sequence of joint- 
observation at stage fc, and let {X^ : fc > 1} as before denote 
the sequence of states. Then the expected cost incurred by X fn 
p = 1, 2, when the PO policy pair used is ( 771 , 712)1 and when 
its initial observation is o p , can be written as 


OO 


Ur 7fl ,7f 2 


(°p) = E e 


Op _ 

7Tl ,7T2 


k =1 


5p(2ffc, A 2 ,k)) , 


(28) 


where A 1>k = 77i(Oi jfe ) and A 2 , k = 7i' 2 (0 2ifc ). 

Similar to the completely observable case, the objective for 
the partially observable (PO) case is to characterize PO-NEPPs 
which are defined as follows: 

Definition 4: We say that a PO policy pair (7r(,7f 2 ) is a 
PO-NEPP if G^l n { 0l ) < G^ un (p 1 ) for all 0l £ 0 1 and 

PO policy 7f! e n^,, and G^) w ,{o 2 ) < G^)^ 2 {o 2 ) where 
o 2 £ 0 2 and 7f 2 £ n D . 


We will end this section with the expressions for the various 
cost terms corresponding to a PO-NEPP, which are analogues 
of the cost terms in Section m 

Various Cost Terms: Recall the expression for If 1 ' from 
(111. Given a NEPP (77*, 7 t 2 ), D W is the cost incurred by 
if it continues alone. Similar expressions can be written for a 


PO-NEPP (7 


1, '< 2 


): 


o (1) 


= T+'EpVc&i^t). 


(29) 


Similarly, for j£" 2 , the cost of continuing alone is 

T> m = r + 5 ><?c£? 

3' 


(30) 


The following lemma will be useful. 

Lemma 3: Let (7r*,77|) be an NEPP and (fk\. 77 ^) be a PO- 
NEPP then J. 


(i) 


G 


( 2 ) 


= GAJ w ,{r u t) and = 

S^ r o)- 

Proof: Whenever :f\ is alone in the system, all its 
observations (which are of the form (r-;,f) until ter¬ 
minates) are exactly the actual states traversed by the sys¬ 
tem. Hence the problem of obtaining G^l (r.j , f) is iden¬ 
tical to the MDP problem of obtaining (r,,f) in Sec- 

so that G^} satisfies the Bellman equa¬ 


tion 

tion 


V-B 


in (|K)1>. Since the solution to © is unique |3y| we 
obtain (r,,f) = G^}-,.(r t .t). Similarly it follows that 

4f,n; (*> r i ) = G n l5 f• ■ 

Discussion: An immediate consequence of the above lemma 
is that and D^ = D( 2 \ Further, if (7f^,7f 2 ) is a 

PO-NEPP then for states of the form ( n,t ), 77 is same 
as 7T*(rj,f) in (12). Similarly, for states of the form (t,rj), 
7f 2 (t,rj) is same as that in (15 1 . 
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However, the analogues of the cost terms ciW 2 and 
(recall (18 i and ( p~9| )) are different for the partially observable 
case. The expressions for these are, 

T + < 3 d 


= r + 'Eu-P? |V G £,w a (*>;')■ (32) 

t',3' 

Finally, similar to the result in Lemma [2] we can show that 
for a PO-NEPP (w{, W* 2 ), 


~r,W ^ nW a ~nW , ^( 2 ) 

D < C w * and D < CU =.. 


(33) 


The proof of these is along exactly the same lines as the proof 
of Lemma [2] We do not repeat it for brevity. 


B. Partially Obsen’able Stochastic Game (POSG) 

A POSG is a tuple (A f, X, O, { A p }, f, { ffp }), where N, X, 
A p , and g p are as before (see Section V-A[ >, while 

• O = Xp^y/Op is the joint-observation space, with O p 
being the observation space of player p, and 

• T : X x O x A—, A(TF x 0) is the transition function 
where T(x ', o'\x, o, a) is the probability that the next state 
and the joint-observation is (x',o') conditioned on the 
event that the current state, joint-observation and joint- 
action is (a;, o, a). 


In the previous section we have seen that the NEPPs 
for a stochastic game can be obtained by constructing a 
normal-form static stage game. Similarly for POSGs, there 
is work (for instance see, |40) ) that constructs a game which 
is effectively played at each stage, however, with the play¬ 
ers not knowing the exact state of the system the stage 
game now happens to be a Bayesian game 0 Chapter 9]. 
Hence, the drawback with POSGs in general is that, at each 
stage k, each player needs to maintain a belief (distribution) 
about the entire history of joint-observations and joint-actions, 

((0l,l, 02,l), (dl.lj 01,2), ' ' ‘ j (oi,fc-l, 0-2,k-l), (Ol.fe, 02,fc), 

(referred to as the joint-type of the Bayesian game), obtaining 
which for a general POSG is computationally intensive. 

For this reason the authors in |42| have studied a restriction 
of POSGs referred to as, Markov games of Incomplete infor¬ 
mation (MGII). In MGIIs the transition function T satisfies 
the following Markov property: player-l’s belief about the 
player- 2 ’s current observation, o' 2 , is independent of player- 2 ’s 
previous observation, 02 , given the current state, x‘ ', previous 
state, x, and player-l’s current and previous observations, 
o' 1 and 0 \, respectively, i.e., for two different observations 
u,v £ 0 2 of player-2, T(o' 2 \x\ x, o' 2 , Og, o 2 = u) = 
T{o' 2 \x',x,o'\,o\,o 2 = v). Similar Markov structure should 
hold for other players also. For our case it is easy to check 
that the above condition is trivially satisfied, primarily because 
all the associated random variables, {Lf) and {(i?i,fc, f? 2 ,fe )} 1 
are i.i.d. across the stage index k. 

A major advantage with MGIIs is that the current joint- 
observation constitutes the type of the Bayesian game to be 
played at that stage. With this in mind, we will set up a 


Bayesian stage game in the next section, with (r,, £) and (£, rf) 
constituting the type of the game at stage k, provided both 
forwarders are still competin Jj at stage k. 


C. Bayesian Stage Game 

We are now ready to provide a solution to the partially 
observable case in terms of a certain Bayesian game 0 
Chapter 9] which is effectively played at any stage whenever 
both forwarders are contending. For the completely observable 
case, given a policy pair (7 Ti, 7T2), corresponding to each 
(r,, rf) pair we constructed the normal-form game in Table [4] 
However here, given a PO policy pair (/T-j, tt 2 ) and given the 
observation ( rt ,£), J^r’s belief that the game in Table[4](with 
replaced by (C^^C^^)) will be played 
is p^J , j £ [n]. Hence, needs to first compute the costs 
incurred for playing S and C, averaged over all observations 
(£, rf, j £ [n], of fP 2 . We will formally develop these in the 
following. 

Strategy vectors and corresponding costs: Fixing the PO- 
policy pair to be (ff 1 , 772 ) (unless otherwise stated), we will 
refer to the subsequent development (which includes, the 
strategy vectors, various costs, best responses and NE vectors, 
to be discussed next) as the Bayesian game corresponding to 
(7f l5 7f 2 ), denoted Q(tt 1 ,tt 2 ). 

Definition 5: For £ £ C (recall that C is the set of possible 
relay locations), we define a strategy vector, fg, of as 
ft '■ {ri i £ [n]} —v {s, c}. Similarly, a strategy vector 
gg of j ^2 is 9t '■ {fj ■ j € [n]} —> {s,c}. Thus, given the 
observation (ri,£) of ,fP-\ , fg decides for whether to stop 

or continue. 

Now, given the strategy vector gg of . 5 ^ 2 , and the location 
information £, J^i’s belief that fp 2 will choose action C is 

9t = Y, Pj\V ( 34 ) 

j-9e(rj)=C 


(1 — gg) is the probability that will stop. Thus, the expected 

cost incurred by j^j for playing S when its observation is 
(rj,£) and when uses gg is 


Q?g e (ri) = ffe(-mn) + (1 -ge)E W {n), (35) 


where, recall from ( 161 that E^irf) = v\ (—rpr'i) + u 2 D^ l \ 
The various terms in ( 35J) can be understood as follows: fg 
is the probability that fP 2 will continue in which case ■ r ¥\ 
(having chosen the action S) stops, incurring a terminating 
cost of —while (1 — gg) is the probability that J ? 2 will 
stop in which case the expected cost is, ui^—rjirf) + u 2 D ! X\ 
v\ is the probability that gets the relay and terminates 
incurring a cost of (—771 rf), otherwise w.p. v 2 , fp 2 gets the 
relay in which case continues alone, the expected cost of 
which is (from Lemma [3)1. 

The expected cost of continuing when J^i’s observation is 
(Gj £) is 


Cglin) = ggC { ^ 2 + {l-gg)DV. (36) 


7 When only one forwarder is present we already know that the solution can 
be obtained by solving an MDP problem as in Section [V-B| (see Lemma |3j. 
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From the above expression we see that the cost of continuing 
is a constant in the sense that it does not depend on the value 
of r,. Hence we will denote it as simply C^J e . Further, note 
that C ( c 'J f depends on the PO policy pair (7fi,7f2)> but for 
simplicity we have not shown this dependence in the notation 
for C$ t . 

Similarly for fP 2 , when its observation is (£, Vj) and when 
uses ft, we have 

C'SCr,) = + 

c^i = /4?,w* + (f-/^ (2) . 


where f e = ]T Pm- 

i-fe{ri )=C 

Definition 6: We say that ft is the best response vector 
of 1 against the strategy vector gt played by J? 2 , denoted 
ft = BR 1 (gt), if ft(ri) = S iff C^ e (n) < C^J f . Note that 
such an ft is unique. Similarly, gt is the (unique) best response 
against ft if, gt{rj) = S iff C^ 2 J e (rj) < We denote this 

as gt = BR 2 (ft). 

Definition 7: For £ £ C, a pair of strategy vectors (//,<?)>) 
is said to be a Nash equilibrium (NE) vector for the game 

G(Wi,W 2 ) iff n = BRi(g*t), and g\ = BR 2 {ff). 

As remarked earlier, it is not immediately clear whether 
a NE vector should even exist among the pure strategies 
for the game G ( 7 F 1 , 7 T 2 ). Our main result in the next section 
(Theorem [4| is to provide a positive answer to this. In fact, 
we will not only prove the existence of NE vectors but also 
provide a method to construct them. 

We will end this section with the following theorem which 
is similar to Theorem |T}(b), that was used to obtain NEPPs. 
This theorem will enable us to construct PO-NEPPs. 

Theorem 3: Given a PO policy pair (tr*, ttf), construct 
the strategy vector pair : £ £ C} as follows: 

ft(n) = 7f l(ri,£) and g{{rf] = for a11 M £ M- 

Now, suppose for each £, ( ffi9i ) is a NE vector for the game 
G{ft\,Tt 2 ) such that. 


tin | 


c: 


(i) 




s , gi (ri)A 

(g).c', 


(i) 


( 2 ) 

c,/; 


} = G^l n (n,£), and (37) 

} = G^ n (£,rj). (38) 


Then ( 7 ^, 7 ^) is a PO-NEPP. 

Proof: See Appendix [E] ■ 

Discussion: If {(//,fl£)} happens to be a NE vector, then 
from Definition [7] it simply follows that the LHS of ( |37] > 
(resp. (38 1 ) is simply the cost incurred by (resp. for 
playing the action, ff(rt) (resp. g*t(rj)), suggested by its NE 
vector. Thus, ( [37| and ( [38] ) collective say that the cost-pair 
obtained by playing the NE vector (/|,g|) in the Bayesian 
game G{t£*, ttf), is equal to the cost-pair incurred by the PO 
policy pair {ft\,Tt 2 ) i n the original POSG. Hence, this result 
could be thought of as the analogue of Theorem [l}(b) proved 
for the completely observable case. 

Existence of a NE Vector: We will fix a PO policy pair 
(ff*, 7f 2 ) that satisfies the inequalities in (33 1 . In this section 
we will prove that there exists a NE vector for G ftt\ fitf) • 
Before proceeding to the main theorem we need the following 
results (Lemma [4] and [5]). 


Lemma 4: For any £ £ C, the best response vector, ft, 
against any vector gt of is a threshold vector , i.e., there 
exists an <£>t £ {0,1, ■ • • , n} such that ft{ri) = S iff i > 

We refer to as the threshold of ft- Similarly, if gt is the best 
response against any vector ft of J?i, then gt is a threshold 
vector with threshold t. 

Proof: Since ry < r, whenever f < i, we can write 


a 


(i) 


s,ge( r i') ft Gc*g t ( r i) ( see <|35|). Then the proof follows by 
recalling Definition |6] ■ 

Remark: The above lemma is possible primarily because of 
the independence assumption we had imposed at the beginning 
of Section VI Suppose we had worked with the model where, 
given only ry, J^i’s belief about J^’s observation is simply 
the conditional p.m.f. pr 1 m 2 ( r j |g), j £ [n], then, as in (34 1 , 
we can write the expression for the continuing probability as 


9e,n = PRuR 2 ( r j\ r i), (39) 

3 : 9e( r j) =c 

which is now a function of r,;. If we replace gt in ( [35] ) by gt, r , 
it is not possible to conclude, Ci 'J^ (r.j/) > C'i 'g, (r,;) whenever 
i' < i, as required for the proof of the above lemma. 

The following is an immediate consequence of Lemma |4] 
if (fl,g*t) is a NE vector then ff and g\ are both threshold 
vectors. Thus, we can restrict our search for NE vectors 
over the class of all pairs of threshold vectors. Since a 
threshold vector ft can be equivalently represented by its 
threshold <!>/ we can alternatively work with the thresholds. 
Thus <&t £ Ao ■= {0,1, • - - ,n} represents the n + 1 
thresholds that can use. <fy = 0 (respectively, <!»/ = n) 

represents the threshold vector which, when used by J^, stops 
(respectively, continues) for any value of ry when the location 
is £. Similarly, we will represent the n + 1 thresholds that 
#2 can use by £ Ao. We will write = BRi(f&t) 
whenever their corresponding threshold vectors, ft and gt, 
respectively, are such that ft = BRi(gt). Similarly, we will 
write 1 = BR 2 {fi>t) whenever gt = BR 2 (ft). 

Lemma 5: (1) Let T ’t,^t £ An be two thresholds of 
such that < ^° e , then the best response of to 
these are ordered as, BRi(fi>t) > BRif&f). (2) Similarly, if 
<&t,$°t £ An are two thresholds of ■£7'\ such that <l>t < 
then BR 2 (§t) > BR 2 (<&f). 

Proof: See Appendix [F] ■ 

We are now ready to prove the following main theorem. 
We will present the complete proof here because the proof 
technique will be required in the next section to construct 
PO-NEPPs. 

Theorem 4: For every £ £ C, there exists a NE vector 
{ft- dt) for the S ame 0{n 

Proof: As mentioned earlier, a consequence of Lemma [4] 
is that it is sufficient to restrict our search for NE vectors 
within the class of all pairs of threshold vectors. Let Ao := 
{*< : 0 < <&t < n} denote the set of all n + 1 thresholds 
of ^ 1 . Now, for 1 < k < n, inductively define the sets Bk 
and Ak as follows: Bk = ^BR 2 (f^t) ■ £ ^4fc-ij and 

A k = {BR^t) : ^(.eBt,]. 

It is easy to check that through this inductive process we 
will finally end up with non-empty sets A n and B n such that 
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• for each At £ A n there exists a unique At £ B n such 
that At = BRi{At), and 

• for each At £ B n there exists a unique At £ A n such 
that At = BR 2 {At). 

Since best responses are unique, these would also mean that 

I Aj| = \B n \. 

Note that there is nothing special about this inductive 
process, in the sense that for any normal form game with 
two player, each of whose action set is Ao, this inductive 
process will still yield sets A n and B n satisfying the above 
properties whenever the best responses are unique. However, 
it is possible that there exists no pair (At, T^) £ A n x B„ 
such that At = BRi(Af) and At = BR 2 (At). For instance. 
An = B n = and BR 2 (Af) = At and 

BR 2 (A'f) = A' e while BRtiAt) = A) and BR^A'f) = A e . 
This is precisely where Lemma [5] will be useful, due to which 
such a situation cannot arise in our case. 

Now, arrange the N = \A n \ (= \B n \) remaining thresh¬ 
olds in A„. and B n as, A^\ < At ,2 < < &e,N 

and At \ < A .2 < • • • < At respectively. Then 
At,i = BRi(Ai,n), since if not then using Lemma [5] we 
can write $71 < BR\{A i,n) < BR\{At,t) for every 
t = 1 ,2, • • • ,N contradicting the fact that Af\ being in A n 
has to be the best response for some At, t £ B n . Similarly 
Ai,n = BR 2 (At 1 ), otherwise again from Lemma[5]we obtain 
At.N > BR 2 {At,i) > BR 2 (Ai, t ) for every t = 1,2, • • ■ , N 
leading to a contradiction that A^n is not the best response 
of any At,t £ A n . Thus the threshold strategy pair (//,< 7 ^) 
corresponding to the threshold pair (Atj, At,jv) is a NE 
vector. By an inductive argument, it can be shown that all 
the threshold vector pairs corresponding to the threshold pairs 
^,jv-(t-i)), t = 1, 2, • • • ,N, are NE vectors. ■ 
D. PO-NEPP Construction from NE Vectors 


Once we have obtained NE vectors ( ff,g |), for each 
£ £ [to], The procedure for constructing PO-NEPP from NE 
vectors is along the same lines as the construction of NEPP 
from NE strategies (see Section V-C 1 . 

We begin with a pair of cost terms, C = (G* 4 ^, C*' 2 '*), satis¬ 
fying ( |33| . Using the procedure in the proof of Theorem [4] we 
obtain, for each £ £ C, the NE vector (fj 7 ,gj) corresponding 
to the threshold pair (<f>£ 1 , Tv.jv) (A using lowest threshold 
while <#2 uses the highest). Then we define 


G (1) {nJ) = min { C 2v(r-i)A^v} 

G (2) (£, rj ) = min | v (■ rj ), C^ }. 


Now recall the expressions for the costs C ' 1 and from 
© and ( [32| . Compute the RHS of these expressions by 
replacing (■) and (■) by the functions G^^(-) 

and G ( ' 2> (■), respectively. Denote the computed sums as 
Ti(C) and T 2 (C), respectively. Suppose C is such that 
C = (Ti(C),T 2 (C)) (we inductively iterate to obtain such 
a C) then using Theorem [ 3 ] we can construct the PO-NEPP, 
(7fy,7fJ) using [fj,gj) as follows: for each i,j £ [n] and 
£ £ G, iff (n,£) = fjin) and 7 f ^ 7 (£, r 3 ) = gj(r 3 ). 

Finally, since the threshold vector corresponding 

to the threshold pair Ai) (A us iug highest threshold 


while J ^2 uses the lowest) is also a NE vector, one can 
similarly construct the PO-NEPP, (fff ,7f^), using (f^,g^). 


VII. Cooperative Case 


It will be interesting to benchmark the best performance that 
can be achieved if both forwarders would cooperate with each 
other. In this section, we will describe this case and construct 
a Pareto optimal performance curve. 

We will assume the completely observable case. The defi¬ 
nition of a policy pair ( 771 , 772 ) and the costs and 

(, x ) w iH remain as in Section [v] However, here our 
objective is instead to optimize a linear combination of the 
two costs. Formally, let 7 £ (0,1), then the problem we are 
interested in is, 

Minimize^ iOT2) (7 jW W2 (a;) + (1 - 7)4^2 (*)) ■ ( 40 ) 

Let ( 77 ^, 772 ) denote the policy pair which is optimal for the 
above problem. Then, using ( fi~8| > and ( | 1 9| , it is easy to show 
that ( 771,774 is also optimal for 

Minimize^ ;7r2) (t<4 ~ 7)<4 2 4 (*)) • (41) 


We have the following lemma. 

Lemma 6: The policy pair ( 77 ^, 773 ) is Pareto optimal , i.e., 
for any other policy ( 771 , 772 ), 


(1) if 

(2) if Ci% 2 


<g; 


(i) 


7Ti 7I- 2 

>(2) 


then G 


( 2 ) 

77Z.7 

i(l) 


< ci% 2 , 
<ci% 2 . 


and 


< CA 1 then C y 

71"! ,7T 2 , /> 2 

Proof: Available in Appendix [G] ■ 

Thus, by varying 7 £ (0,1), we obtain a Pareto optimal 
boundary whose points are (G , T ,4 ,). Details on how 
to obtain ( 77 ) ,774 is available in Appendix [g] 


VIII. Numerical and Simulation Results for the 
Geographical Forwarding Example 

A. One-Plop Study 

The one-hop study can be more general, requiring only a 
joint p.m.f. pij, a location p.m.f. qt, and conditional p.m.f.s 
4 ^1 and //'jj (for all i,j and £). However, to illustrate the 
practicality of our study, we will study the geographical 
forwarding example described in Section |Hl| 

Recall the packet forwarding scenario illustrated in Fig. [T| 
We will fix the locations of A and £A 2 to be V\ = [0, |] and 
v 2 = [0, — |], respectively. Thus, the distance of separation 
between the two forwarders is 0 meters (m); we will vary 9 
and study the performance of the various policies. The range 
of each forwarder is d = 80 m. The combined forwarding 
region is discretized into a uniform grid where the distance 
between the neighboring points is 5 m. Finally, the sink node 
is placed at vq = [1000, 0]. 

Next, recall the power and reward expressions from 0 
and (|6j, respectively. We have fixed d re f = 5 m, ( = 2.5, 
and a = 0.5. For LA'q, which is referred to as the re¬ 
ceiver sensitivity, we use a value of 10~ 9 milliWatts (mW) 
(equivalently —90 dBm) specified for the Crossbow TelosB 
wireless mote {43] , The maximum transmit power available 
at a node is P m ax = 1 “W (equivalently 0 dBm; again from 
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Fig. 3. Performa nce of the various NEPPs and PO-NEPPs are depicted as points in 9ft 2 where the first (s econ d) coordinate is the expected cost incu rred 
by (J^ 2 )- Fig. |(a)| corresponds to the case when the distance of separation 6 = 0 m. A portion of Fig. |(a)| is enlarged and shown in Fig. |(b)| Fig. |(c)| 
corresponds to 6 = 10 m. 


the Crossbow TelosB data sheet). We allow for four different 
channel gain values: 0.4 x 10 -3 , 0.6 x 10 -3 , 0.8 x 10 -3 , and 
1 x 1 (T \ each occurring with equal probability. Finally, we fix 
t?i = V 2 = 100 (recall that p p is the parameter used to trade¬ 
off between delay and reward (see (|2|»), v\ = 1 v-> = 0.5 

{v p is the probability that F p will win the contention), and the 
mean inter-wake-up time r = 10 milliseconds (ms). 

We first set 9 = 0 m (recall that 9 is the distance 
between the two forwarders) and, in Fig. 3(a)| depict the 
performance of various NEPPs and PO-NEPPs as pair of costs 
c = (C'W.C'W) where C (p) is the cost incurred by F p 
starting from time 0 if the particular NEPP or PO-NEPP is 
used. Also shown in Fig. |3(a)| is the performance of a simple 
policy (the point marked x; to be describe next) along with 
the Pareto optimal boundary (the solid curve). Since, from 
Fig. 3(a) it is not easy to distinguish between the various 


points, we show a section of Fig. 3(a) as Fig. 3(b) Fig. 3(c) 
corresponds to 9 = 10 m. 

Various Policy Pairs: The description of various points seen 
in Fig. [3] is as follows (we will use C syrn b 0 i to denote the cost 
pair corresponding to the policy symbol ): 

• ★.Q.and □: performances of the NEPPs that uses the NE 
strategies (s,c), (c,s), and the mixed strategy (ri,r2), 
respectively, whenever ( 77 , 77 ) £ TZ^C^), 7?-4(CQ),and 
7^4 (Cq), respectively (recall Fig. [ 2 J 1 . 

• V and A: performances of the PO-NEPPs that are 
constructed by choosing, for each £ £ C, the thresholds 

and respectively (recall the 

proof of Theorem [4]). 

• x: performance of a simple policy where each forwarder 
& p (p = 1, 2) chooses S if and only if its reward value 


•p > 


a 


(p) 


Such a policy is optimal whenever -F p 


alone in the system (recall ( [T2| and (p3]>). Thus, using the 
simple policy each forwarder behaves as if the competing 
forwarder is not present. 

solid curve: Pareto optimal boundary obtained by 
( 7 t 7 , 7 t 2 ), 7 £ (0,1); recall Section 


VII 


Observations: From Fig. |3(b) we see that operating at NEPP 
★ is most favorable for 'F-i since is less than the cost to 


-F ‘2 at the other two NEPPs, Cq and C^ 1 . This is because 
whenever (77, 77 ) £ 72 . 4 (C*) the joint-action (s,c) played by 


f(2) 


:k fetches the least cost (of D^) possibly by any strategy. In 
contrast, ,Fy incurs highest cost (of — 77477 ) possible because 
of which NEPP + is least favorable for d£j. For a similar 
reason, operating at NEPP O is most favorable for :F\ while 
being least favorable for # 2 . The NEPP □ which chooses 
the mixed strategy (Ti,^) whenever (77 . 77 ) £ 72.4 (Cn) 
helps to achieve a fairer cost to both players, however the 
performance at □ is slightly farther from the Pareto boundary 
when compared with the other two NEPPs. 

The performance at the PO-NEPPs, V and A, is worse 
than at the NEPPs thus exhibiting the loss in performance 
due to partial information. The PO-NEPP V which uses the 
NE vector corresponding to the lowest-highest best response 
pair, ($^ !, \fh? jy) (for each i £ C), provides lower cost to 
,F'i than the PO-NEPP A. This is because, -F-\ using a lower 
threshold will essentially choose an initial relay, thus leaving 
jF'i alone in the system which can now accrue a better cost. For 
a similar reason, operating at A leads to achieving a lower 
cost. Finally, the simple policy x has the worst performance 
in comparison with all other points, suggesting that it may 
not be wise to be operating using this policy pair. However, 
as we increase the value of 9 the performance of the simple 
policy improves, and interestingly for 9 = 10 m (which is 
only 12.5% of the forwarders’ range of 80 m) we observe 
that the various points are practically indistinguishable (note 
that the magnitude of the scales in plots Fig. 3(a) and 3(c) is 


the same). We have observed similar trend when rp = 77 2 and 
a are set to different values. 


Key Insight: Thus, based on our numerical work we draw 
the following key insight: even for a small distance of sep¬ 
aration between the forwarders, using the simple policy pair 
(where each forwarder behaves as if it is alone in the system) 
yields little (or, practically, no) loss in performance when 
compared with the performance of an NEPP or a PO-NEPP; 
however the performance degradation of the simple policy is 
significant whenever the forwarders are very close to each 
other. These observations are for the case where there are 
two forwarders. However, we expect a similar behavior for 
the simple policy even if there are more than two forwarders, 
i.e., we believe that the simple policy performs well if the 
competing forwarders are moderately separated. 
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B. End-to-End Study 

Finally, in this section we use simulation to provide an 
evaluation of the end-to-end performance of local forwarding. 
The competitive forwarding policies (i.e., NEPP and PO- 
NEPP) are difficult to implement since their structure has to 
be evaluated for each forwarding instance along the path of 
a packet. However, based on our observations in the previous 
section, we study the performance of the simple policy pair. 
In our prior work we have already studied the simple policy’s 
performance (see 0 Fig. 8] where the simple policy is referred 
to as SF), but there the setting was that of the lone packet 
model where a single alarm packet is generated which is then 
routed to the sink. Here, we will generalize the lone packet 
setting by generating multiple packets simultaneously across 
the network so that a packet, along its route, might have to 
compete with other packets in its vicinity before reaching the 
sink. 

We first form a network by randomly placing 1000 nodes 
in a square region of area 1 Km 2 . A source node is placed at 
[ 0 , 1000 ] followed by a sink node at the diagonally opposite 
corner [1000,0]. Each node is allowed to asynchronously and 
periodically sleep-wake cycle with period T = 100 ms, i.e., 
each node i wakes up and stays ON for a small duration (which 
we neglect, given the other time scales) at the periodic instants 
Ti + kT, k > 1 where {T)} are i.i.d. uniform on [0, T] (recall 
the discussion on the sleep-wake process from Section ID- 

Each node i, assuming an inter-wake-up time of 1 /N r 
(where Ay is the average number of nodes in the forwarding 
region of node i), obtains o:, which is the threshold (on reward) 
required to implement the simple policy by node i. The values 
of all the other parameters required to compute the threshold, 
e.g., P m ax, C etc., remain the same as in our one-hop study. 
If there is no relay whose reward value is more than ay (node 
i will know of this after waiting for one entire duty-cycling 
period T), node i, at time T, will simply forward the packet to 
the relay with the maximum reward (thus, as relays wake-up 
the best relay so far, is asked to wait). 

The source node generates an alarm packet at time 0. 
We introduce competition by generating additional packets 
at randomly chosen nodes, randomly in time at the points 
of a Poisson process of rate A. All the packets are destined 
for the same sink. While forwarding, if a relay is chosen 
simultaneously by more than one forwarder, then randomly 
one of them will win the contention and gets the relay to 
forward its packet to. We are interested in studying, as a 
function of A, the performance obtained (in terms of end-to- 
end delay and the total power expended) in routing the source’s 
packet. 

In Fig. [4] we have plotted, for different values of A, the mean 
end-to-end delay vs. the mean end-to-end power (averaged 
over packets from the source located at (0,1000)). These 
curves are obtained by varying 77 , the parameter used to trade¬ 
off between delay and reward in the local problem. Each data 
point in Fig. [4] is the average of the respective quantities over 
100 alarm packets generated by the source node. Also shown 
in the figure is the performance curve corresponding to the 
“lone packet case” where no additional packets are generated. 



Fig. 4. End-to-end performance (average power vs. average delay) of the 
simple policy as the competing packet rate A in the network is increased. 

Hence the lone packet curve is analogous to the SF policy’s 
performance curves in @ Fig. 8 ], 

Observe that, as we increase A we obtain a degradation 
in performance, i.e., increased delay and power compared 
with the lone packet case. This is because, as A increases, 
since there are more packets in the network, there is a larger 
probability that a forwarding node carrying the source’s packet 
has to compete with other packets in the process of acquiring 
a relay. Also, as A increases, at these instances of competition, 
the competing nodes tend to be closer together. From the 
observations in the previous section, we can conclude that 
as A increases the performance of the simple policy will 
progressively degrade. However, the performance degradation 
is only marginal when the packet rate A < 20 packets/sec 
while being moderate for A = 30 packets/sec, thus supporting 
the usage of the simple policy for these packet rates. For 
higher values of A (e.g., A = 40 packets/sec and beyond) the 
performance degradation is significant and hence there could 
be a benefit in using NEPPs to forward packets for these rates. 

Finally, we have only presented simulation results for the 
simple policy, since implementing NEPPs or PO-NEPPs for 
end-to-end routing has the following difficulties: ( 1 ) for a given 
pair of neighboring nodes, obtaining NEPPs will require fixed 
point iterations, (2) NEPPs are node pair dependent, so that 
all possible neighboring node pairs are required to compute 
the corresponding NEPPs, since during actual forwarding a 
node may be competing with any of its neighbors. Thus, there 
is a large complexity involved in implementing NEPPs. In 
contrast, the simple policy (being a single threshold based) is 
easy to implement. Moreover, for realistic parameter values 
corresponding to TelosB wireless mote, we have seen that the 
performance of simple policy is good (in comparison with the 
lone packet case) for packet rates A < 30 packets/sec. 

IX. Conclusion 

We studied the problem of competitive relay selection when 
two forwarders compete for a next-hop relay (or some resource 
in general). We first considered the model where complete 
information is available to both the forwarders. We formulated 
the problem as a stochastic game and proceeded to obtain 
solution in terms of Nash equilibrium policy pairs (NEPPs). 
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We were able to provide insight into the structure of NEPPs, 
which was primarily possible because of our following key 
result (Lemma [2|: “cost of continuing alone” is less than 
the “cost of continuing along with a competing forwarder”. 
We next studied a partially observable case for which we 
constructed a Bayesian game which is effective played at each 
stage. For this Bayesian game, we proved the existence of a 
Nash equilibrium strategy within the class of (pure) threshold 
vectors (Theorem [4]). The proof method of this result enabled 
us to construct NEPPs for the partial case. For the geograph¬ 
ical forwarding example, through numerical experiments we 
observed that, even for moderate separation between the two 
forwarders, the performance of our simple policy is as good as 
the performance of any other NEPP/PO-NEPP. In the context 
of end-to-end forwarding, through simulations we established 
(for the considered setting) that for packet rates less than 30 
packets/second, the performance of the simple policy is good 
compared with the lone packet case. 
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Appendix A 
Proof of LemmaQ] 

For convenience, here in the appendix we will recall the respective Lemma/Theorem statement before providing its proof. 

Lemma [l] is the unique fixed point of /3^(x) (x £ (—00, r n ]) in (13 1 . 

Proof: Let us recall the expression of /3^(x): 


/8^(x) = E max{cc,f?i} 


m 


where the expection is w.r.t. the p.m.f. pt 1 ' 1 of R\ (recall that Ri takes values from the set ,r n }). 

Let m = max{* £ [n] : p^ > 0}. For x > r m , note that P^\x) = x — < x. Hence a fixed point, if any, should 

lie within (— 00 , r m ]. Let us restrict /3^(-) to the domain (— 00 , r m ]. Then, since /3^\x) < r m for any x £ (— 00 , r m ], we 
have /?( x ) : (— 00 , r m ] —)• (— 00 , r m ]. We can now proceed to show that x ) restricted to a: £ (— 00 , r m ] is a contraction 
mapping , i.e., for any x,x' £ (— 00 , r m ], we need to show that 

|| (3^(x) — j3^\x') ||< n || x — x' || 
for some n < 1. Without loss of generality let x > x'. Then, 

|| — /3 ( - 1 ' l (tE / ) || = j3^\x) —/3^\x') 


(42) 


= E 


max{a:,f?.i} — E maxjx^i?!} 


t 

< 


n 

max{x,rj} - ma x{x',ri}^J 

2=1 

m 

y^p t (1) f max{x,rj - max-jV,^}) 
2=1 
m— 1 

^ ^ max{x, — max{x', 

2=1 
m— 1 

]T pf ] (x - x'J 


2=1 


= (1-Pm) II X-X’ || . 


In the above derivation, * is because p^ = 0 for i > m (recall the definition of to); o is because, since x,x' < r m , we 
have ^max{s,r m } — max{o;', r m }j = 0; to obtain f note that, ^ max {at, r,} — ma,x{a:'. r,} j < (x — x') for any r^. Thus, 

P^(x), x £ (— 00, r m ] is a contraction mapping (recall 42 1 with k = (1 — p$) < 1 (since > 0 from definition). Hence 
from the Banach fixed point theorem [381 it follows that there exists a unique fixed point a* £ (— 00, r m ], i.e., a* satisfies 

a*=/3( 1 \a*). 

Now, suppose we can show that 


4‘Itt* ( n,t ) = min | - -mn, 


(43) 


then, recalling the expression for I ) 111 from ( 111 , we obtain 

£)(!) 


= 


-VI 

T 


— Vn (1) J (1) (r- t] 


= E 


max{a*, Ri} 


= pW(a*) 


r 

Vi 


Thus, a ^ is the unique fixed point of 

To show (43 1 , we proceed as follows. Let Joirf) = 0 for all r^, and for k > 1 dehne Jfc(ri) inductively as 


Jk(n) = min | — rjifi, r + E 


( 44 ) 
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Since our problem with one player is equivalent to the optimal stopping problem studied in (39) , the above iterations converge 
to the optimal cost, i.e., lim k^-oo Jk( r i) = Now, defining oq = — Ji{rf) can be written as Ji(rf) = 

min{— iftri, — r/iai}. Proceeding further we can write. 


Jiin) = min | — 771 ?^, r + E | 

= min | — rjiri, r + E min{— 7710 : 1 , — rjiR} | 
= min | - r/m, 

■{ 


= mint - 771^,-^102 


} 


where o 2 = Similarly it can be shown that, if Jfc_i(ry) = min | — 77ir,, — ?7iafc_i j, then 

Jk(n) = min | - 771 r», -771 a fc } 


(45) 


where Ofc = j3^ 1 \ak- 1 ). Thus a*; —7 a*. Finally, in the above expression taking the limit as k —7 00 on both sides, and using 
Jfe(rj) —7 jft and a& —7 a*, we obtain the desired result. ■ 


Appendix B 
Proof of TheoremQ] 

Theorem [I] Given a policy pair, (tt) ,tr 2 ), construct the static game given in fable [ 5 ] Then the following statements are 



c 

s 

c 

c c y ,,c ( y „ 

7r i ' K -\ 

D(P, —r/2rj 

s 

—TJlTij D^ 2 l 

EW(n),EW( rj ) 


TABLE 5 

Static stage game. 


equivalent: 


(a) is an NEPP. 

(b) For any x = (ri,rP, (ttI(x),tt 2 (x)) is a Nash equilibrium (NE) strategy for the game in Table Bl Further, the expected 

7l\ /o\ I—I 

cost pair at this NE strategy is, (x), (x)). 


Proof of (a) => (b): Suppose (a) is true, i.e., (7r*,7rJ) is an NEPP. Then, 77 * is the best response policy of against 
the policy of J^. Hence 7r* is optimal for the MDP problem, denoted MDPi( tt 2 ), which is obtained by fixing the policy 
7T2 of (note that M I)Pi (tt 2 ) is a time homogeneous MDP since n 2 is stationary; recall Definition [TJ. Since (1) the states 
of the form (t, r :] ) are absorbing and cost free for , and (2) the policy of M-\ which never stops incurs infinite cost to M -\, it 


follows that MZ)Pi(7r|) 
equation. 


is an optimal stopping problem |39], Hence, x = (r,;, rf) satisfies the following Bellman 


4*1^ ( x ) = min {C's(m), C'c(x) | 

= min|7T2(x,c)(-77irj) + 7r %(x,s)E < ‘ 1 \r i ), 

n* 2 (x, c)C^. + tt* (x, s )D« }, (46) 

where n 2 (x,c) (resp. ir 2 (x, s)) is the probability that will choose action C (resp. S) when the state is x. The two terms 
in the min-expression above (denoted C s (x) and C c {x)) are the expected cost to for taking actions S and C, respectively. 
Note that these costs are exactly the expected cost incurred by - , for playing actions S and C, respectively, in the static game 

in Table [4] when the strategy of jF 2 is tt 2 (x). Now 7t*, being optimal for MDPi(tt 2 ), chooses action 7r^(x) £ A({s,c}) 
whichever gives a minimum cost or can randomize between the two if both the costs are equal. Hence, it follows from the 
structure of ( |46| that 7r* (x) is the best response against n 2 (x) for the game in Table [4] Further the cost to J?i for playing 
7r*(x), from Table is min { C s (x), C c (x )} = (x). 

Similarly, by writing the Bellman equation corresponding to the M DP 2 [n\) problem (which is obtained by fixing the 
policy 7r^ of J?i), we can conclude that tt 2 {x) is the best response against 7t£(x) for the game in Table [4] with the cost to 
player being J^J v ,(x). 
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Proof of (b) ==> (a): Given that the policy ( 7 rJ, tt%) satishes the condition in (b), let -k\ be any policy of J^i. Then, for any 
x = (fi. Vj), since (7t*(a;),7r|(a;)) is a NE strategy for the game in Table |dj with cost to at equilibrium being 4*V*( x), 
we can write 

4*1** (■ x ) < ^{x,c)(^l{x,c)C { ^l n ,+^l{x,s)D^ + 

7Ti (x, s) ^2 (x, c)(-7?1 rfj + 7 T 2 (x, s).E ( 1) (r*)). 

LHS of the above expression is the cost incurred to when the strategy played is ( 7 Ti(a;), 7 r|(a;)) (refer to ©). 

Substituting for D^\ (r,J and C^V. T », (from (JTTjl, |l 6 | > and ( jisj ), respectively) in the above expression and then 
rearranging, we can write 


4%,(x) < E* 1iT . g 1 (X 1 ,(A hl ,A 2>1 )) 


+ E: 


7Ti,7f 2 


4*1*,* (*0 


Observe that, „.»(■) appears on the RHS of the above expression. Hence, inductively applying the above inequality I\ 
times, we obtain 


K 


4*U*W < Y, 9i(X k ,(A ltk ,A 2tk )) 


fc =i 


-E: 


5 7r 2 


4*E(^+i) 


Taking limit as K —¥ oo in the above expression we obtain, 

A 1 ) tM 


•$!«<*> £ •%<*> + 


(47) 


Now, let A t = {(f,f)}U{(f, rj) : j £ [n]}. A, is the set of all states, which are entered once terminates. We will assume that 
the policy pair ( 7 ^ , 7 ^) is such that will eventually terminate starting from any state x, i.e., lim^-^oo P* 7r » ( £ A t ) = 1, 
or equivalently, for any x' At, lim/c— s-oc P^ (Xk = x') = 0 (otherwise, with positive probability will continue forever 
incurring a delay cost of r at every stage yielding x) = 00 , so that the inequality ^(x) < J^ v ,(x) trivially 

holds). Using this along with J^J n ,(x°) = 0 for any x° £ At, we can write 


JlV K (X K+1 ) = lim Y K 1 ,^XK +1 =x°)4 1 J 7r ,(x°) + 


lim E* _ 

K^too 1 ' 2 L n l> n 2 


v x°£At 


E 

x'tfiAt 


■ 7Tl,7T 2 


(**+i 


x' tfiAt 

x' £At 


= 0 . 


Note that, in * interchanging the limit and summation was possible because we have a finite sum (since our state space is finite). 
Also, since we have restricted ourselves to the class of stationary policies (recall Definition jlj, jft n *(x') is not a function of 
the stage index K, which enables us to proceed to o. Finally, using the above in (47» we obtain, J^\\,(x) < J^ n ,(x). 

( 2 ) ( 2 ) — 12 

Similarly, for J ^2 it can be shown that J^J Jr »( 2 ) < J^J x) for any 7 r 2 and x = ( rt,rj ). ■ 


Appendix C 
Proof of Lemma[2] 

Lemma [2] will be an immediate consequence of the following result. 
Lemma 7: Given an NEPP ( 7 ^, 7 ^), for any (r. t , r 3 ) £ X we have. 


4*1*3 ( r i>0 - 4*1*5 


7r l 7 7T. 
j(2) 

T 1 J vt 2 


Atni (<> r i ) < 4*1*! (G, Tj ). 


(48) 

(49) 


Proof: We will prove only (48 1 ; the proof of (49 1 is along similar lines. Since , ttA is an NEPP, it follows that the 
policy nl is the best response for against the policy 7rJ of ,^ 2 , i.e., for any x £ X, jJ+ {x) = inf Wl 41V* 0*0 ■ Thus 


(x) can be regarded as the optimal cost of the MDP problem, M1)I\ (ttJ), obtained by hxing the policy 7 rj of ,^ 2 - 
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For simplicity of notation we will denote (x) as H*(x). Thus for the states of the form ( n,t ), H*{ri,t) satisfies the 

following Bellman equation (this expression is same as the one in |[I0|) 


H 


'( n,t) = min | - t?ir i ,C'c(rj,/)|, 


where 


C c (n,t ) =T + ^2p[} ) H*(r i >,t) 


(50) 


(51) 


is the expected cost of continuing, and —rjiVi is the cost of stopping. 

However, for states of the form (r,. rj) (where J ^2 is also competing for a relay), the optimality equation is more involved 
since the actions of will now affect both costs (stopping and continuing) of Defining e = Tr^fr,, r, } . S) (e is the 
probability with which will stop when the state is {r. k . rj)), Bellman equation for states of the form (r,;, Tj) can be written 
as 


= min|c's(r i ,r J ),C' c (r i ,r j )|, 


(52) 


where C s {r ik rj) is the expected cost incurred by for stopping when the state is (r.j, Tj), and C c (rj,r ? ) is the expected 
cost of continuing. 

The expression for C s (r,, r 3 ) is (recall that v p , p = 1,2, is the probability that ,^ p gets the relay if both forwarders 


simultaneously choose to stop). 


C s {n,rj) = e^i(-)7ir i ) + v 2 C c {ri,tj^ + (1 - e)( - tyir,). 


(53) 


The first term in the RHS of the above expression is the expected stopping cost incurred by conditioned on the event that 
J ^2 also decides to stop. This can be understood as follows: suppose JF 2 also decides to stop (probability of which is e), then 
w.p. vi, gets the relay incurring a termination cost of —r/i r,, otherwise & 2 gets the relay in which case has to continue 
alone, the expected cost of which is C c (ri,t). The remaining term, (1 — e)(— rjiVi), in (53 ( is the stopping cost incurred to 
when the action of is to continue (which happens with probability (1 — e)). 

Similarly, the cost incurred by for continuing, C c (ri,rj), can be written as, 


C c {ri,rj) = e[r+ H*{r v ^ + (1 - e) (r + r f )J. 


(54) 


Now, returning to (50 1 and <52 1 , H* can be expressed as the fixed point of a mapping T which is, for a function fT(-,-), 
given by, 


TH(n,t ) = min | - r? 1 r l ,C , c // (r ; ;T)| 

TH(n,r 2 ) = minjcf 


where the expressions for C^(ri,t), C^(ri,rj) and [r-i , r : j) is similar to that of C c {ri,t), C s {ri,rj) and C c (ri,rj) (in 
© S and ( |54| >, respectively) with H* replaced by the given function H. Inductively define H k = TH k -i with Hq = 0 
(i.e., Hq(x) = 0 for all x £ X). Since MDPi^n^) is an optimal stopping problem 1391 it follows that H k — > H* (this is the 
value iteration algorithm). Hence, to complete the proof we will show that H k {rj,t ) < H k {ri,rj) whenever < 

Hk— i {r i , vj ). 

Suppose, for some k > l, H k _ 1 (r i ,t) < H k _i(ri,rj) for all (r i; rj) £ X (this holds trivially for k = 1). First consider the 
case where, —rji r* < C^ -1 (r^,/) (i.e., it is optimal to stop when the state is (r,,f)). 

• Then from (53 1 we obtain —rjiTj < (ri,rj). 

• Also, from the induction hypothesis we have 

i' i',j' 


Using the above in (54 1 and recalling (51 1 we can write 


— i 


= C c {n,t) 

> -vin- 
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Thus we have, 


Hk(n,t) = min { - Vin,C* k 1 




< minjcf* y {r i ,r j ),Cc k 1 (n,r i )| 

= H k {ri,rj). 


Similarly for the other case, i.e., when —r]iri > Cjf [rtf), we can show that both the costs, C^ k 1 (r,, Tj) and C$ k 1 (ri, ry), 
are less than C^(rt,t) again yielding H k {ri,t ) < H k (ri,rj). ■ 

We are now ready to prove Lemma [2] 

Lemma [2j For an NEPP, ( 7 ^, 7 ^), the various costs are ordered as follows: 

D (1) < Cf) and L> (2) < 

7Ti ,7T 2 5^2 

Proof: Recalling the cost expressions of D W and C , ^l' ) 7r , (from (JTTJ» and (18 1 , respectively) we can write, 

L» {1) = T+ J nl-K*S r i-,t) 

i' 

i',j' 

< T + J2 Pi'S (n>, ry ) 


-,H k . 


= c. 


( 1 ) 


where * is due to Lemma |?J Similarly, one can show that lf 2> < n ,. 

Appendix D 

Obtaining NE Strategies for the Static Game in TableQ] 
For convenience, let us first recall the game in Table [4] 



c 

s 

c 

c ( V ,,cA , 

7r i 71 1 

D t 1 ), — ri2Vj 

s 

-riiri,DW 

EW(n),EW(rj) 


TABLE 6 

Static stage game. 


Since only two actions (namely S and c) are available to each forwarder, a strategy used by can be conveniently 
represented by or £ [0,1], where a \ is the probability that will choose action S. Similarly, 172 £ [0,1] is the probability 
that J ^2 will choose action S. Given a strategy pair (or, 172 ) the expected cost (obtained from Table [4]) incurred by .'¥\ can be 
expressed as 


where 


and 


Ui(a 1 ,a 2 ) = a 1 A (T2 + B a2 , 

Aa 2 = (1 - 02)( - rjiTi - C (1) ) + a2^E ( ' 1 ) (n) - L> {1) ) 


(55) 


(56) 


B a2 = (l-a 2 )C^ +a 2 D^. 

Let cr* ( 02 ) denote the set of all best responses of to the strategy a 2 of , 5 ^ 2 , i.e., 

al(a 2 ) = argminC/i(cri,cr 2 )- (57) 

o-i e [0,1] 

Since Ui(<ti,o 2 ) is linear in or it follows that, <7 *(ct 2) = {0} whenever A„ n > 0, al(o 2 ) = {1} whenever A ao < 0, and 
ct*(ct 2 ) = [0,1] whenever A„ 2 = 0. We make use of these observations in the proof of our next lemma. First for convenience 
let us denote the thresholds ^-2 and Off by and £( 2 \ respectively. Recall that we already have, a W = and 


-m 


a ( 2 ) _ PfA The inequalities in (20 1 would (since, —77 < 0) imply that < a W and £( 2 ) < aA\ 


-m 
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(T 2 ' 
1 


1 Ol 



OT 
1 


1 a l 


o ri 


1 cri 



<Jl- 

1 


(j 2 

1 


1 cri 


1 0-1 


0 I\ 


1 0"! 


(a) 


(b) 


(c) 


(d) 


(e) 


(f) 


Fig. 5. Plot of best respons e cur ves, a(((J 2 ) a nd for (V,. r, ) in different re gions . In each of the se fig ures, the solid red c urv e is <x*((T 2 ) and the 

dashed blue curve is (CTi).[(a)| (rj, rj) £ 7?i.[(b )](ri, 7-j) e 7^-5, [(c)] (n, rj) £ 72 - 4 , [(d)] (ri, rj) £ 7^2a,|(e)|(rj, r^) £ 7 ^ 26 . and [?0] (r t , rj ) e 72.2c- 

Lemma 8: Suppose £ (0,1) and D W < then 

1) If r.i < C (1) then al(o 2 ) = {0} for all cr 2 £ [0,1], 

2) If Vi > then cr*(cr 2 ) = {1} for all a 2 £ [0,1]. 

3) If C (1) < n < a.A) then defining 


r 2 = 


-run - c (1) 


(- mn - cw) - (^(rO-iX 1 )) 


(58) 


we have: (i) al(a 2 ) = {1} for cr 2 < F 2 , (ii) cxi(er 2 ) = {0} for cr 2 > F 2 , and (iii) o-*(F 2 ) = [0,1], 

Proof of Part 1: We will show that A a2 > 0 for any a 2 £ [0,1]. Then the proof follows immediately since, U-\ (<J-\,a 2 ) = 
ct 1 A ( j 2 + B a2 , is linear in a\. 

Let us recall the expression for A ao , 


A ff2 = (1 - <r 2 )( - mn - c (1) ) + <r 2 (c w (n) - L> (1) ), 

where (— r/i rf) + v 2 DA) (see It is already given that n < or 

-C (1) ) > 0. 


( 




(59) 


(60) 


Since < CA) we also have —rfan > D W which gives EA^fn) > DA) (this is where £ (0,1) is required), i.e. 


^E^(n) — D > 0. Using this along with inequality 


(60 1 we obtain the desired result. 


(Proof of Part 2) Similar to the previous part, the proof follows once we show that A a2 < 0 for all <x 2 £ [0,1]. Since 
n > and DA) < CA\ we obtain r/ir,; — CA)^ < 0 and ^E^(ri) — DA)^ < 0. Using these in (591 we obtain A a2 < 0. 

(Proof of Part 3) Again, since U\(cti,a 2 ) is linear in we have to show that A a2 < 0 whenever er 2 < T 2 , A„ 2 > 0 
whenever er 2 > T 2 , and /lr 2 = 0. 

Suppose <j 2 £ [0,1] is such that a 2 < T 2 (thus T 2 £ (0,1]), then recalling the expression for T 2 we can write, 

-pin - CA) 


o 2 < 


( - run - CA)\ - (EA)(n) - DA)') ' 


(61) 


It is important to note that, since c ' < r, < l f_ with D l [ ) < C' 1 '> and v\ £ (0,1), the denominator in the RHS of the 


£>(!) 


-m 


-Vi 


above expression is strictly negative. Thus, rearranging ( |6T| ) we obtain A„ 2 < 0 so that al(a 2 ) = {1}. 

Similarly, when a 2 £ [0,1] is such that er 2 > T 2 (in which case T 2 £ [0,1)), then reversing the inequality in (61 1 and 
rearranging we obtain A a2 > 0 so that al(a 2 ) = {0}. 

Finally, substituting for T 2 in the expression for A t - 2 will yield ,4 r , 2 = 0 implying that any ay £ [0,1] is a best response 
against the strategy P 2 played by & 2 . Hence cr*(r 2 ) = [0,1]. ■ 

Remark: The condition imposed on iq and C (l 1 in the above lemma is only to avoid the less interesting boundary cases. 
Also, note that T 2 is a function of the reward r, to . For notational simplicity we do not show r, as an argument of 1’ 2 . 

Similarly, for ■ r A 2 we can define af A (o -\) as the set of all best responses against the strategy cti played by , and obtain a 
result analogous to that in Lemma [8] but with quantities corresponding to ■’¥\ replaced by that corresponding to & 2 , e.g., for 
instance, replaced by (A), a (A) by aA\ r 2 by Ti where 


Ti = 


—rfcrj - CA) 


( - 772 rj - CA)^ - ^EA)(rj) - DA 


(62) 
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7^3 b" 

n 3<r 


*•( 2 ) 




(s,s) 


, n 3 

(c,s) 

72-4 

(s,c) 

(c,s) 

(ri,r 2 ) 


^7^-2 c 

7^i 

(c,c) 


n >~~ 

7^-2 b 



7^-2 a 


c (1) 


a 


(i) 


Fig. 6. Illustration of the various regions along with the NE strategy corresponding to these regions. 


Now, for any ( Ti,rj ) the points of intersection between the best r esponse curves o-j(0- 2 ) and 02 ( 01 ) constitutes the NE 
strategies of the game in Table [ 4 ] For instance, as shown in Fig. 5(a) when (r, , r :j ) is such that n < C (1) and rj < (d 2 ) (i.e., 
( fi,Vj ) £ 72|; see Fig. [6j» then the only point of intersection is (0, 0) so that (c,c) is the only NE strategy in this region. 
Similarly when ( rj,Vj) £ 72s then (s,s) is the only NE strategy (see Fig. 5(b) 1 . An interesting case is when ( Vi,Vj ) £ 724 
(see Fig. 5(c) 1 where there are multiple NE strategies, namely, (s,c), (c,s) and the mixed strategy (Fi,r 2 ) (which depends 
on the reward pair (r, , r :i ); see remark following Lemma [8]i. 

The region 72 2 is written as a union of three disjoint regions, 72.2a, 72-26 and 722 C . However, as shown in Fig. 5(d) to 
best response curves for ( rj,rj ) in each of these sub-regions intersect at (1,0). Hence (s,c) is the NE strategy in the union 
region 72 2 . Similarly, (c.s) is the NE strategy in region 723 which is also composed of three sub-regions. 


5(f) the 


7?i = { 

(ri,rj) : r; < CW,ry < C (2) | 


n 2 = K 

Tl2a = 

b = 

Tl2c = 

, 2 a U TZ 2 b U Tl 2 c where 
{( r *i r j) ■ < ri < a^\rj 

(ri,rj) : r; > cd 1 ),^- < 
(ri,ry) : r t > a W,£( 2 ) < rj 

< C (2)] 

< a W 


■r 3 = k 

Tl 3 a = 

7^36 = 

7? 3c = 

, 3 a U 1Z 3b U TZ 3c where 
{( n,rj ) : n < C (1) ,C (2) < rj 
( ri,rj ) : ri < > cd 2 ) 

( ri, fj ) : (l 15 <n< 

<cd 2 )' 

> cd 2 ) 


k 4 = { 

i r ii r j) : C (1) < rj < | 

K-S = { 

to. 

V 
p^ 

3? 

V 

p^ 

to 



TABLE 7 

Formal definition of various regions depicted in Fig. [6 ] 


We have thus identified a partition of the set {(t*,Tj) : i,j £ [n]} into five regions such that the set of NE strategies 
corresponding to each region are different. These regions along with the corresponding NE strategies are depicted in Fig. [6] A 
formal definition of the various regions is available in Table]?] Note that, these regions depend on the cost pair C = (C 1 ' 1 , CT 2 - 1 ); 
for simplicity we have not shown this explicitly in Fig. [6] and in Table [7] 


Appendix E 
Proof of Theorem[3] 


Theorem^ Given a PO policy pair (-7T*, tv 2 ), construct a strategy vector pair {(//. g} ) : l £ £} as follows: (r, t ) = 7 rj(rj, £) 
and g}(rj) = W? 2 (£, r 3 ) for all i,j £ [n]. Now, suppose for each £, is a NE vector for such that, 


min { C s (2 /;(G').^c 2 /;} 


G^(g, 7), and 


Then (r(,7f*) is a PO-NEPP. 
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Proof: Given the policy pair ( 77 ^, 772 ) as in the hypothesis, let 7Ti be any PO policy. We will show that G^V (r,;. l) < 

G^ w »(r.ijy, the proof of, G^) v , (r, . £) < G^J Wo (ri,£) for any W 2 , is along similar lines. 

Since /| = BR\(g*A, for the Bayesian game Q(f f*,7F 2 ), the expected cost incurred to when its observation is ( Ti,P) is 


ua^uoicui gamL. n 2 

.{c^.(ri),C^.}. Hence, using (37 1 we can write 


G f f, n (n,t) < 43„, m ( n) 


iri,-!r 2 

Applying the above inequality K times we obtain 

K 

-»(i) 


5l(^l> (^1,1) ^ 2 , 1 )) 


-E 


>i,e) 

' 7r l> 7r 2 


G 


(i) 


■(O i, 2 




/c=i 


-E, 


(j-if) 

" 7r li 7r 2 


GiV^(Oi,K+i) 


(63) 


Again, as in the proof of Theorem [I] Part-(b), we will assume that the PO policy pair (tt-j , 7t 2 ) is such that using this policy 
pair 3P\ will eventually terminate starting from any observation 0 \, i.e.. 


lim P °-\-,{O hK =t) = 1. 


(64) 


Hence we have 


lim E£ r ”il 


-^,*3 G^(O hK+ 1) = 

jfC—^00 ’ 2 L 1 ’ 2 J 1 ’ 2 

= 0 . 


Using the above and recalling (28 1 while taking \im.K^oo in ( 63 1 we obtain G^V w *{ri,£) < G^^(ri,£). 

Finally, suppose the PO policy pair ( 771 , 772 ) does not satisfy ~^o4]>, then there is a positive probability that &\ will continue 
forever yielding G^ 7r , (r,, f) = 00. Thus for this case, G^V 7r » (r,, l) < G^" 1 7r , ( r t . 1), trivially holds. ■ 

Appendix F 
Proof of Lemma[ 5 ] 

Lemma [5] (1) Let x \> (. A" G Aq be two thresholds of J? 2 such that A/ < '■I’f then the best response of ■A\ to these 
are ordered as, BRi(\\>e) > BR\ (T)(). (2) Similarly, if <tq, <!>)' G _4o are two thresholds of .sP, such that < <E>| then 

> BR 2 (&?). 

Proof: For convenience, first let us recall the expressions of the costs Cs. 1 ,;, (r, ) and G^J e from (351 and (36 1 (since the 
given PO policy pair is (7f*,7r 2 ), these costs correspond to the Bayesian game ( 7 ( 771 , 77 ?;)): 

4sUg) = M-Vifi) + (1 - ge)E (1 \n) 


C, 


(i) 


= geC^l w ,+(l~g e )D^. 


(65) 

( 66 ) 


Also, recall from ( [33] ) that the cost of continuing alone is less than the cost of continuing along with the competing forwarder, 
i.e., 


D (1) < g£? 


— 7T •, ,7T 0 


(67) 


We will only prove Part-(l); the proof of Part-(2) is similar. Let gg and g^ be the threshold vectors of -A2 whose corresponding 
thresholds are T/ and 'l>'f respectively. Given that At < Tto prove Blf ('I'/ ) > BR\ it is sufficient to show that, for 

any n , C^ e (n) < C^ e implies G^a(n) < G^l. 

Let us begin with an r, such that C^g e (ri) < C^g e , or alternatively, (recall (65 1 and ( 661 ) Vi is such that, 

ge(-vm) + (1 - ge)E < ' 1 \r i ) < geC ( ^l n + (1 - ge)D {1) . 

Substituting E^\rf) = iq(— g\Vi) + in the above expression, and then simplifying we obtain, 

g f C^l n +(l-~g t )v 1 D^ 

-mn < - ~ . /, —^- 

ge + { 1 - gi)v 1 

Thus — 7717 'i, being less than the convex combination of the costs G^V and D^\ is less than both of these. Further since 

D 1 ' 1 ' 1 < C~m w », there are only two cases which are possible: — rqrj < < C~~ w ,, and D ^ < —rftri < G^V . We will 

consider these two cases separately below. 
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Case-1: Suppose -ViU < -D (1) < G^L., then 


E^\n) = n) + u 2 D (1) 

< D (1) . 

Using the above two inequalities in the expression of Cg'j,, ( 77 ), and then comparing with C^o we obtain ( 77 ) < C^„. 


c ,s? 


Case-2: Suppose < —77117 < G^V^*. Then we have > D^\ Define n(p) for p £ [0,1] as, 

k(p) = p( - 77177 - £ (1) (g) - C'jjVs* + T> (1) ) + (g (1) (?7) - £> (1) ). 


' c >9l 


( 68 ) 


Since —771 77 < G^V w » and f?( 1 ^ ) (r i ) > we have, /c(p) is decreasing in p. Hence we can write «(p|) < k(^) because, 
with we have. 


\P£ T'? 

fc = &^£$=5?- 

j=i i=i 


Finally, rearranging the terms in (681 one can obtain, 

-'( 1 ) ( r .\ _ 
y S ~C,g£ 




9 i 


< «(») 

* 

< 0, 

where * is because we started with an 77 such that, Cg^ ( 77 ) < C^J e . 


(69) 


Appendix G 

Obtaining Cooperative Optimal Policy Pair 

We will first prove Lemma [6] 

Lemma [ 6 j The policy pair is Pareto optimal, i.e., for any other policy ( 711 , 712 ), 

(1) if Ci% 2 < G^V^ then G^-y < Ci% 2 , and 

(2) if Ci% 2 < then C$Jy < d% 2 . 

71 1 > /l 2 71 1 i 71 2 

Proof: We will prove Part-(l); the proof of Part-(2) is similar. Since (77 . 77 ) is optimal for the problem in (40 1 we can 
write, for any policy pair ( 7 Ti, 7 T 2 ), 


+(1-7)G^> < 7G« a + (l-7)G® 


A2) 

y 7I-7,7T 2 


rewriting which we obtain 


G 


( 2 ) 


< 


1-7 


f rK 1 ) _ c , ( 1 ( 'j 4.77(2) 

(S, *2 + °*T,7r 2 - 


Now, if Gi ,' ) j7r2 < G^V^-t then from the above expression we have ^ < C^ V2 . ■ 

We now proceed to obtain (C^ T ,G^ ,) by formulating the problem in (40i as an MDP The state space, action space 
and the state transitions remain same as in Section [V] However the one-step costs have to be appropriately modified to take 
into account the multiplier 7 . Without writing down all the details we will proceed to the Bellman equation. The one-step 
costs will be evident from these. 

For states of the form ( 77 , f) 


J*{r z ,t) = min|- 777ir i ,7T + ^p|, 1) J*(r',f)|. 


(70) 


The first term in the min expression above corresponds to the cost of the joint-action (s, s) or (s, c) (since .^7 has terminated, 
its action is irrelavent), and the second term is the expected cost of choosing (c, s) (or (c, c)). Similarly, when the state is 
(t, rf) we can write 

J*{t,r 3 ) = min| - (l-7)7? 2 r j ,(l-7)T + ^^v ) J*(/,r')}. 


(71) 
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The more interesting case is when both forwarders are still competing, i.e., when the state is of the form { r t . r : j ), where the 
optimality equation is 


J — m i n C'c.si Cs,si Cc,c 


(72) 


C auaa is the expected cost (one-step + future cost-to-go) of choosing the joint-action (ai, a 2 ). When the joint-action chosen 
is (s,c), since stops and continues the one-step cost is ^ — yifoVi + (1 — 7 ) 7 -^. The subsequent state is of the form 
(, t,rj> ) w.p. Hence the expression for C s ,c can be written as 


C s , c = -ymn + Q- - i)r + Y,pf' J*{t,r r ). 


Similarly C'c.s can be written as 


Cr. 


= 7 T — (1 — 7)772^ + Y Pi' J * (*V 1 1) ■ 


(73) 


(74) 


When both forwarders decide to stop then w.p. 17 , ,¥\ gets the relay in which case the cost incurred is C s ,c; otherwise, w.p. 
V 2 , get the relay and the cost incurred is C c .s- Hence 

C's.s = nC's.c + ^Cc.s- (75) 

Finally, when both forwarders continue the one-step cost is simply ^ 7 r + (1 — 7 )r^ = r and the subsequent state is still 
of the form (rv,ry). Thus we can write 


c, 


c,c 


i' ,3' 


(76) 


From (75 1 note that C' ss > min{C s ,c, C'c.s} which means that the joint-action (s, s) can never be optimal. Thus, under 
cooperation the forwarders never compete for a relay; either -¥-\ will choose the relay, or .^ 2 will choose, or both continue. 
Expression (72 1 can therefore be simplified as 

J = min jC-sx, Cc.s- t'cxj- (77) 

One can perform value iteration to solve for J* in ( |70| , ( fTT] ) and ( |77| i. Given J* it is easy to obtain the optimal policy 
( 777 ,^ 2 ) (simply choose the joint-action that minimizes the RHS of these expressions, breaking ties arbitrarily) and then the 
cost pair, (C^,C^ 3 ). 




